DeepMind researchers speculate on the reason for DeepSeek V4's delay: doubling the training data to 33T caused severe instability.

robot
Abstract generation in progress
ME News message, April 24 (UTC+8), according to monitoring by Dongcha Beating, the DeepSeek V4 technical report disclosed that V4-Flash and V4-Pro were pre-trained on 32T and 33T tokens respectively, doubling the approximately 15T tokens of V3.
The report admitted that "significant instability challenges were encountered" during training, with loss spikes recurring repeatedly. The root cause lies in outliers in the MoE layer, and the routing mechanism itself exacerbates these outliers. Simple rollback cannot fundamentally solve the problem.
DeepSeek found two solutions and has applied them in actual training: Anticipatory Routing, which decouples the routing index computation from the backbone network update, automatically triggering only when a loss spike is detected, with an additional overhead of about 20%; SwiGLU Clamping, which clamps activation values to a fixed range to directly suppress outliers.
The report states that both are effective, but acknowledges that "the underlying principles are not yet fully understood."
Google DeepMind researcher Susan Zhang (formerly at Meta AI and OpenAI) commented that the instability caused by doubling the training data "explains the delay," describing the two solutions as "band-aids," while also affirming DeepSeek's technical transparency.
(Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned