Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
Muon quietly "starves" 25% of neurons: Aurora's repair boosts data efficiency by a hundredfold
According to Beating Monitoring, Tilde Research discovered that the optimizer Muon used in leading models like DeepSeek V4, Kimi K2.5, and GLM-5 has a hidden flaw: it causes more than a quarter of the neurons in the MLP layers to die permanently early in training. Based on this, the team designed an alternative optimizer called Aurora and open-sourced it. A 1.1B parameter model trained with only about 100B tokens was able to match the performance of Qwen3-1.7B, which was trained on 36T tokens, on language understanding benchmarks like HellaSwag and Winogrande.
The problem lies in a mathematical property of Muon when handling the MLP weight matrices. Early in training, some neurons happen to receive weaker gradient signals. Traditional optimizers like AdamW normalize parameters step-by-step, naturally smoothing out these differences; but Muon’s orthogonalization step passes the weak signals unchanged. Weak neurons continue to receive weak updates, becoming increasingly silent, creating a “rich get richer” dead cycle. By the 500th training step, over a quarter of the neurons have effectively died, wasting parameter capacity.
Previous improved version NorMuon mitigated this by forcing the update magnitudes of each row to be uniform, but at the cost of destroying the orthogonality of the update matrix (which makes each update as efficient as possible and is a core advantage of Muon), resulting in reduced optimization accuracy. Aurora sets “uniform updates” and “orthogonality” as joint constraints, iteratively satisfying both: ensuring each neuron has a fair learning opportunity without sacrificing update precision.
Unparameterized Aurora only adds about 6% more computational overhead compared to Muon and can be used as a direct replacement. In modded-nanoGPT benchmark runs, Aurora achieved the current best record in 3,175 steps. Aurora’s advantage will also grow with increasing MLP width; the higher the expansion factor, the more significant the improvement.
The code and the 1.1B pre-trained model have been open-sourced.