Research shows that the Muon optimizer causes some MLP neurons to die permanently during early training because its orthogonalization preserves weak signals. Aurora balances both by simultaneously constraining uniformity and orthogonality updates, using alternating iterations to avoid neuron death and improve efficiency. Without tuning, Aurora only requires about 6% more computational overhead than Muon to replace it, achieving a level comparable to Qwen3-1.7B trained on approximately 100 billion tokens with a 1.1 billion parameter model; the code and model have been open-sourced.

BlockBeatNews

2026-05-10 04:07:30

Abstract generation in progress

According to Beating Monitoring, Tilde Research found that the optimizer Muon used in top models like DeepSeek V4, Kimi K2.5, and GLM-5 has a hidden flaw: it causes more than a quarter of the neurons in the MLP layers to die permanently early in training. Based on this, the team designed an alternative optimizer called Aurora and open-sourced it. A 1.1B parameter model trained with only about 100B tokens nearly matches the performance of Qwen3-1.7B trained on 36T tokens on language understanding benchmarks like HellaSwag and Winogrande.

The problem lies in a mathematical property of how Muon handles the MLP weight matrices. Early in training, some neurons happen to receive weaker gradient signals. Traditional optimizers like AdamW normalize parameters step-by-step, naturally smoothing out these differences; but Muon’s orthogonalization step passes the weak signals unchanged. Weak neurons continue to receive weak updates, becoming increasingly silent, forming a “winner-takes-all” dead cycle. By the 500th training step, over a quarter of the neurons have effectively died, wasting parameter capacity.

Previous improved version NorMuon mitigated this by forcing the update magnitudes of each row to be uniform, but at the cost of destroying the orthogonality of the update matrix (which makes each update as efficient as possible and is a core advantage of Muon), resulting in reduced optimization accuracy. Aurora sets “uniform updates” and “orthogonality” as joint constraints, iteratively satisfying both: ensuring each neuron has a fair learning opportunity without sacrificing update precision.

Unparameterized Aurora only adds about 6% more computational overhead compared to Muon and can be used as a direct replacement. In modded-nanoGPT benchmark runs, Aurora set a new record with 3,175 steps. Aurora’s advantage will also grow with wider MLPs; the higher the expansion factor, the more significant the improvement.

The code and the 1.1B pre-trained model have been open-sourced.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateSquareMayTradingShare
1.06M Popularity
#
BTCBackAbove80K
59.45M Popularity
#
JapanTokenizesGovernmentBonds
1.91M Popularity
#
DailyPolymarketHotspot
871.28K Popularity
#
WCTCTradingKingPK
766.46K Popularity

Sitemap

Muon quietly "starves" 25% of neurons: Aurora's repair boosts data efficiency by a hundredfold

Trending Topics

GateSquareMayTradingShare

BTCBackAbove80K

JapanTokenizesGovernmentBonds

DailyPolymarketHotspot

WCTCTradingKingPK

Pin