Muon quietly "starves" 25% of neurons: Aurora's repair boosts data efficiency by a hundredfold

robot
Abstract generation in progress

According to Beating Monitoring, Tilde Research found that the optimizer Muon used in top models like DeepSeek V4, Kimi K2.5, and GLM-5 has a hidden flaw: it causes more than a quarter of the neurons in the MLP layers to die permanently early in training. Based on this, the team designed an alternative optimizer called Aurora and open-sourced it. A 1.1B parameter model trained with only about 100B tokens nearly matches the performance of Qwen3-1.7B trained on 36T tokens on language understanding benchmarks like HellaSwag and Winogrande.

The problem lies in a mathematical property of how Muon handles the MLP weight matrices. Early in training, some neurons happen to receive weaker gradient signals. Traditional optimizers like AdamW normalize parameters step-by-step, naturally smoothing out these differences; but Muon’s orthogonalization step passes the weak signals unchanged. Weak neurons continue to receive weak updates, becoming increasingly silent, forming a “winner-takes-all” dead cycle. By the 500th training step, over a quarter of the neurons have effectively died, wasting parameter capacity.

Previous improved version NorMuon mitigated this by forcing the update magnitudes of each row to be uniform, but at the cost of destroying the orthogonality of the update matrix (which makes each update as efficient as possible and is a core advantage of Muon), resulting in reduced optimization accuracy. Aurora sets “uniform updates” and “orthogonality” as joint constraints, iteratively satisfying both: ensuring each neuron has a fair learning opportunity without sacrificing update precision.

Unparameterized Aurora only adds about 6% more computational overhead compared to Muon and can be used as a direct replacement. In modded-nanoGPT benchmark runs, Aurora set a new record with 3,175 steps. Aurora’s advantage will also grow with wider MLPs; the higher the expansion factor, the more significant the improvement.

The code and the 1.1B pre-trained model have been open-sourced.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin