Open-source TTS has finally moved to real-time streaming, Higgs Audio v3's latency control is quite impressive, zero-shot cloning + emotion tagging are pretty elaborate to play with

View Original
CoinNetwork
Boson AI opensource 4B audio model Higgs Audio v3, supports streaming emotion control
Boson AI Open-Source Higgs Audio v3 TTS Weights, based on Qwen3-4B, with approximately 4 billion parameters, optimized for real-time streaming conversations; synthesis starts before the text is fully complete to reduce latency. Supports 100+ languages/dialects, bringing character/word error rate down to single digits. Supports zero-shot voice cloning and can embed 20+ emotions and multiple types of control tags within the text. Achieves end-to-end optimization together with LMSYS in the SGLang-Omni framework; on a single H100, the real-time single-concurrency rate is 0.147. The weights have been released on Hugging Face under a non-commercial research license.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned