ME News Report, May 22 (UTC+8), according to Beating Monitoring, teams from the National University of Singapore, Nanyang Technological University, and Shanghai Artificial Intelligence Laboratory jointly open-sourced the first all-scenario robust speech recognition base model Mega-ASR, aiming to address issues such as hallucinations, word omissions, and blank outputs in real-world speech recognition. The model is driven by Qwen3-ASR 1.7B as the underlying engine, achieving up to nearly 30% performance improvement over models like Whisper, Gemini 3 Pro, and Seed-ASR in extremely complex acoustic environments. Currently, the project has been open-sourced on GitHub, with all code and model weights released under the Apache-2.0 license. The research team built a training dataset called Voices-in-the-wild-2M, containing 2.4 million samples totaling 11k hours. The dataset synthesizes seven atomic acoustic effects—reverberation, echo, additive noise, far-field, frequency packet loss, bandwidth limitation, and clipping distortion—using a physics-based spectral pipeline, and derives 54 complex environmental scenarios. To ensure training stability, the team filtered out samples with a word error rate exceeding 70% and calibrated the difficulty distribution of the dataset through physical plausibility checks. Regarding training mechanisms, Mega-ASR introduces A2S-SFT, a progressive acoustic-to-semantic supervised fine-tuning method, which aligns audio features in stages to enhance the model’s semantic recovery ability under heavy interference. During policy optimization, the model employs a dual-granularity word error rate gated strategy, DG-WGPO, for reinforcement learning. When input audio quality is good and word error rate is low, the system emphasizes character-level acoustic detail reconstruction. If the audio is severely distorted and word error rate is high, the decision mechanism shifts to sentence-level semantic reconstruction, significantly reducing hallucinations and omissions common in large models. To address potential slight declines in recognition accuracy on clean audio, Mega-ASR incorporates a dynamic routing mechanism. The routing decision module can automatically evaluate the current audio quality and intelligently decide whether to load LoRA fine-tuning weights, ensuring the model outputs optimal results in both clean and noisy scenarios. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

10 Likes

Reward
10
5
3
Share

Comment

Add a comment

NeonIceMelt

· 4h ago

What does an extremely complex acoustic environment refer to? Subway + bar + construction site?

View OriginalReply0

GateUser-1bc81bb2

· 4h ago

Led by the domestic team, is this wave of domestically developed models going overseas or is it international collaboration?

View OriginalReply0

MistBlueLily

· 5h ago

Seed-ASR is also being dragged out for criticism, ByteDance: ?

View OriginalReply0

ThereIsAChainInTheReflection.

· 5h ago

Robustness in real-world environments is the hard truth; laboratory metrics look good but fall apart when implemented in practice.

View OriginalReply0

MevInRetrospect

· 5h ago

2.4 million samples, 11k hours, data engineering just looks exhausting to work on

View OriginalReply0

Trending Topics
View More
#
TradfiTradingChallenge
264.52K Popularity
#
PlatinumCardCreatorExclusive
78.65K Popularity
#
DailyPolymarketHotspot
1.03M Popularity
#
GateSquarePizzaDay
606.92K Popularity
#
SpaceXOfficiallyFilesforIPO
554.03K Popularity

Pinned

Sitemap

National University of Singapore and Nanyang Technological University have open-sourced Mega-ASR, reducing hallucinations and word omissions in ASR under extreme noise conditions.

Trending Topics

TradfiTradingChallenge

PlatinumCardCreatorExclusive

DailyPolymarketHotspot

GateSquarePizzaDay

SpaceXOfficiallyFilesforIPO

Pinned