Xiaomi Open Sources OmniVoice: A Model Covering 646 Languages for Voice Cloning

robot
Abstract generation in progress

CryptoWorld News: Xiaomi AI Laboratory’s new generation Kaldi team has open-sourced OmniVoice, a zero-shot speech cloning TTS (text-to-speech) model supporting 646 languages. The model can clone voice timbre using just a few seconds of reference audio and supports cross-language conversion. The code, weights, and training data are all open-source under the Apache-2.0 license. OmniVoice’s architecture takes a minimalist approach, with the model consisting of only a bidirectional Transformer that directly maps text to multiple codebook acoustic tokens, eliminating the need for a two-stage pipeline. Training data comes from 50 open-source speech datasets, which have been denoised and quality-filtered, totaling 580k hours. In tests across 24 languages, the model’s speech similarity and intelligibility surpass several commercial systems. In tests across 102 languages, its intelligibility approaches or even exceeds that of real recordings. In addition to speech cloning, the model also supports features such as text-based voice customization and automatic denoising of reference audio with noise.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin