Xiaomi AI Laboratory's open-source OmniVoice is a zero-shot speech cloning TTS for 646 languages. It uses only a bidirectional Transformer to map text to discrete acoustic tokens, eliminating the need for a two-stage pipeline. The core features full codebook random masking and initialization with large model pre-training parameters, enabling 40x real-time inference with just PyTorch. Training data comes from 50 open-source datasets, totaling 580k hours, with upsampling for low-resource languages. Testing on 24 languages outperforms many commercial systems, and on 102 languages, it approaches real recordings; it can also customize voice tone via text description, perform automatic noise reduction, and correct emotional symbols and proprietary nouns.

BlockBeatNews

2026-05-07 10:35:45

Abstract generation in progress

According to Beating Monitoring, Xiaomi AI Laboratory’s new generation Kaldi team has open-sourced OmniVoice, a zero-shot voice cloning TTS (text-to-speech) model supporting 646 languages. It can clone voice timbre with just a few seconds of reference audio and works across languages: given a Chinese recording, the model can speak Japanese, Korean, or other languages with the same voice. All code, weights, and training data are open source under the Apache-2.0 license.

Architecturally, OmniVoice takes a minimalist approach. The entire model consists of a single bidirectional Transformer that directly maps from text to multi-codebook acoustic tokens (discrete sound encodings), without the two-stage pipeline of first converting to semantic tokens then to acoustic tokens. Two key design choices support this simple structure: a full-codebook random masking strategy to improve training efficiency, and initialization with pre-trained parameters from large language models to enhance pronunciation accuracy. Inference runs 40 times faster than real-time, using PyTorch directly without additional optimization.

All training data comes from 50 open-source speech datasets, totaling 580k hours after noise reduction and quality filtering. Low-resource languages are trained with dynamic upsampling to ensure effective training. In tests across 24 languages, OmniVoice’s voice similarity and intelligibility surpass several commercial systems. In tests across 102 languages, intelligibility approaches or even exceeds that of real recordings. Even languages with less than 10 hours of training data can be synthesized effectively.

In addition to voice cloning, the model supports customizing voice timbre via text descriptions (such as “male, middle-aged, very low pitch” or “female, young, Sichuan dialect”), automatic noise reduction from reference audio, insertion of tone symbols like laughter or sighs, and pronunciation correction for Chinese and English polyphones and proper nouns.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateSquareMayTradingShare
583.82K Popularity
#
BTCPullback
106.32M Popularity
#
CLARITYActStalled
3.29M Popularity
#
CryptoStocksRally
1.42M Popularity
#
DailyPolymarketHotspot
839.83K Popularity

Sitemap

Xiaomi Open Sources OmniVoice: A voice cloning model covering 646 languages, trained solely on open-source data to outperform commercial systems

Trending Topics

GateSquareMayTradingShare

BTCPullback

CLARITYActStalled

CryptoStocksRally

DailyPolymarketHotspot

Pin