Xiaomi Open Sources OmniVoice: A voice cloning model covering 646 languages, trained solely on open-source data to outperform commercial systems

robot
Abstract generation in progress

According to Beating Monitoring, Xiaomi AI Laboratory’s new generation Kaldi team has open-sourced OmniVoice, a zero-shot voice cloning TTS (text-to-speech) model supporting 646 languages. It can clone voice timbre with just a few seconds of reference audio and works across languages: given a Chinese recording, the model can speak Japanese, Korean, or other languages with the same voice. All code, weights, and training data are open source under the Apache-2.0 license.

Architecturally, OmniVoice takes a minimalist approach. The entire model consists of a single bidirectional Transformer that directly maps from text to multi-codebook acoustic tokens (discrete sound encodings), without the two-stage pipeline of first converting to semantic tokens then to acoustic tokens. Two key design choices support this simple structure: a full-codebook random masking strategy to improve training efficiency, and initialization with pre-trained parameters from large language models to enhance pronunciation accuracy. Inference runs 40 times faster than real-time, using PyTorch directly without additional optimization.

All training data comes from 50 open-source speech datasets, totaling 580k hours after noise reduction and quality filtering. Low-resource languages are trained with dynamic upsampling to ensure effective training. In tests across 24 languages, OmniVoice’s voice similarity and intelligibility surpass several commercial systems. In tests across 102 languages, intelligibility approaches or even exceeds that of real recordings. Even languages with less than 10 hours of training data can be synthesized effectively.

In addition to voice cloning, the model supports customizing voice timbre via text descriptions (such as “male, middle-aged, very low pitch” or “female, young, Sichuan dialect”), automatic noise reduction from reference audio, insertion of tone symbols like laughter or sighs, and pronunciation correction for Chinese and English polyphones and proper nouns.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin