DeepMind researcher speculates that the delay of DeepSeek V4: Doubling training data to 33T caused serious instability

According to Beating Monitoring, the DeepSeek V4 technical report discloses that V4-Flash and V4-Pro were pre-trained on 32T and 33T tokens respectively, doubling from about 15T tokens in V3.
The report admits that during training, “significant instability challenges were encountered,” with loss spikes repeatedly occurring, and the root cause being anomalies in the MoE layer. The routing mechanism itself can also exacerbate these anomalies, and simple rollback cannot fully resolve the issue.

DeepSeek identified two solutions and has applied them to actual training: Anticipatory Routing, which decouples routing index calculation from backbone network updates and only automatically triggers when a loss spike is detected, with an additional overhead of about 20%; and SwiGLU Clamping, which clamps activation values within a fixed range to directly suppress anomalies.
The report states that both are effective but admits that “the underlying principles are not yet fully understood.”

Google DeepMind researcher Susan Zhang (formerly at Meta AI and OpenAI) commented that the instability caused by doubling training data “explains the delay,” describing these two solutions as “band-aids,” while also praising DeepSeek’s technical transparency.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin