National University of Singapore and Nanyang Technological University have open-sourced Mega-ASR, reducing hallucinations and word omissions in ASR under extreme noise conditions.

ME News Report, May 22 (UTC+8), according to Beating Monitoring, teams from the National University of Singapore, Nanyang Technological University, and Shanghai Artificial Intelligence Laboratory jointly open-sourced the first all-scenario robust speech recognition base model Mega-ASR, aiming to address issues such as hallucinations, word omissions, and blank outputs in real-world speech recognition. The model is driven by Qwen3-ASR 1.7B as the underlying engine, achieving up to nearly 30% performance improvement over models like Whisper, Gemini 3 Pro, and Seed-ASR in extremely complex acoustic environments. Currently, the project has been open-sourced on GitHub, with all code and model weights released under the Apache-2.0 license. The research team built a training dataset called Voices-in-the-wild-2M, containing 2.4 million samples totaling 11k hours. The dataset synthesizes seven atomic acoustic effects—reverberation, echo, additive noise, far-field, frequency packet loss, bandwidth limitation, and clipping distortion—using a physics-based spectral pipeline, and derives 54 complex environmental scenarios. To ensure training stability, the team filtered out samples with a word error rate exceeding 70% and calibrated the difficulty distribution of the dataset through physical plausibility checks. Regarding training mechanisms, Mega-ASR introduces A2S-SFT, a progressive acoustic-to-semantic supervised fine-tuning method, which aligns audio features in stages to enhance the model’s semantic recovery ability under heavy interference. During policy optimization, the model employs a dual-granularity word error rate gated strategy, DG-WGPO, for reinforcement learning. When input audio quality is good and word error rate is low, the system emphasizes character-level acoustic detail reconstruction. If the audio is severely distorted and word error rate is high, the decision mechanism shifts to sentence-level semantic reconstruction, significantly reducing hallucinations and omissions common in large models. To address potential slight declines in recognition accuracy on clean audio, Mega-ASR incorporates a dynamic routing mechanism. The routing decision module can automatically evaluate the current audio quality and intelligently decide whether to load LoRA fine-tuning weights, ensuring the model outputs optimal results in both clean and noisy scenarios. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 5
  • 3
  • Share
Comment
Add a comment
Add a comment
NeonIceMelt
· 4h ago
What does an extremely complex acoustic environment refer to? Subway + bar + construction site?
View OriginalReply0
GateUser-1bc81bb2
· 4h ago
Led by the domestic team, is this wave of domestically developed models going overseas or is it international collaboration?
View OriginalReply0
MistBlueLily
· 5h ago
Seed-ASR is also being dragged out for criticism, ByteDance: ?
View OriginalReply0
ThereIsAChainInTheReflection.
· 5h ago
Robustness in real-world environments is the hard truth; laboratory metrics look good but fall apart when implemented in practice.
View OriginalReply0
MevInRetrospect
· 5h ago
2.4 million samples, 11k hours, data engineering just looks exhausting to work on
View OriginalReply0
  • Pinned