Research has found that the head model optimizer Muon causes more than a quarter of MLP neurons to die permanently in the early stages of training, due to its orthogonalization preserving weak signals, leading to a "winner-takes-all" effect. Aurora constrains uniformity and orthogonality in parallel, achieving both through alternating iterations, significantly improving learning efficiency and stability. Untuned Aurora can directly replace Muon; the 1.1B model uses only about 100B tokens to nearly match the performance of Qwen3-1.7B trained on 36T tokens across multiple language understanding benchmarks, with advantages increasing as MLP width grows. The code and 1.1B pre-trained model have been open-sourced.

MarsBitNews

2026-05-10 05:11:34

Abstract generation in progress

According to Beating Monitoring, Tilde Research discovered that the optimizer Muon used in leading models like DeepSeek V4, Kimi K2.5, and GLM-5 has a hidden flaw: it causes more than a quarter of the neurons in the MLP layers to die permanently early in training. Based on this, the team designed an alternative optimizer called Aurora and open-sourced it. A 1.1B parameter model trained with only about 100B tokens was able to match the performance of Qwen3-1.7B, which was trained on 36T tokens, on language understanding benchmarks like HellaSwag and Winogrande.
The problem lies in a mathematical property of Muon when handling the MLP weight matrices. Early in training, some neurons happen to receive weaker gradient signals. Traditional optimizers like AdamW normalize parameters step-by-step, naturally smoothing out these differences; but Muon’s orthogonalization step passes the weak signals unchanged. Weak neurons continue to receive weak updates, becoming increasingly silent, creating a “rich get richer” dead cycle. By the 500th training step, over a quarter of the neurons have effectively died, wasting parameter capacity.
Previous improved version NorMuon mitigated this by forcing the update magnitudes of each row to be uniform, but at the cost of destroying the orthogonality of the update matrix (which makes each update as efficient as possible and is a core advantage of Muon), resulting in reduced optimization accuracy. Aurora sets “uniform updates” and “orthogonality” as joint constraints, iteratively satisfying both: ensuring each neuron has a fair learning opportunity without sacrificing update precision.
Unparameterized Aurora only adds about 6% more computational overhead compared to Muon and can be used as a direct replacement. In modded-nanoGPT benchmark runs, Aurora achieved the current best record in 3,175 steps. Aurora’s advantage will also grow with increasing MLP width; the higher the expansion factor, the more significant the improvement.
The code and the 1.1B pre-trained model have been open-sourced.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateSquareMayTradingShare
1.06M Popularity
#
BTCBackAbove80K
59.45M Popularity
#
JapanTokenizesGovernmentBonds
1.91M Popularity
#
DailyPolymarketHotspot
871.3K Popularity
#
WCTCTradingKingPK
766.35K Popularity

Sitemap

Muon quietly "starves" 25% of neurons: Aurora's repair boosts data efficiency by a hundredfold

Trending Topics

GateSquareMayTradingShare

BTCBackAbove80K

JapanTokenizesGovernmentBonds

DailyPolymarketHotspot

WCTCTradingKingPK

Pin