According to the Beating Monitoring, the DeepSeek V4 technical report reveals that V4-Flash and V4-Pro were pre-trained on 32T and 33T tokens respectively, doubling the approximately 15T tokens used in V3. The report admits that during training, "significant instability challenges were encountered," with loss spikes repeatedly occurring. The root cause lies in anomalies within the MoE layer, and the routing mechanism itself can exacerbate these anomalies, making simple rollback insufficient to fully resolve the issue.DeepSeek has identified two solutions and has already applied them to actual training: Anticipatory Routing, which decouples routing index computation from backbone network updates and only automatically triggers when a loss spike is detected, with an additional overhead of approximately

BlockBeatNews

2026-04-24 07:23:33

According to Beating Monitoring, the DeepSeek V4 technical report discloses that V4-Flash and V4-Pro were pre-trained on 32T and 33T tokens respectively, doubling from about 15T tokens in V3.
The report admits that during training, “significant instability challenges were encountered,” with loss spikes repeatedly occurring, and the root cause being anomalies in the MoE layer. The routing mechanism itself can also exacerbate these anomalies, and simple rollback cannot fully resolve the issue.

DeepSeek identified two solutions and has applied them to actual training: Anticipatory Routing, which decouples routing index calculation from backbone network updates and only automatically triggers when a loss spike is detected, with an additional overhead of about 20%; and SwiGLU Clamping, which clamps activation values within a fixed range to directly suppress anomalies.
The report states that both are effective but admits that “the underlying principles are not yet fully understood.”

Google DeepMind researcher Susan Zhang (formerly at Meta AI and OpenAI) commented that the instability caused by doubling training data “explains the delay,” describing these two solutions as “band-aids,” while also praising DeepSeek’s technical transparency.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
168.26K Popularity
#
CryptoMarketSeesVolatility
230.3K Popularity
#
rsETHAttackUpdate
73.98K Popularity
#
US-IranTalksStall
180.25K Popularity
#
ETHMemeCoinFLORKSurges
39.51K Popularity

Sitemap

DeepMind researcher speculates that the delay of DeepSeek V4: Doubling training data to 33T caused serious instability

Trending Topics

WCTCTradingKingPK

CryptoMarketSeesVolatility

rsETHAttackUpdate

US-IranTalksStall

ETHMemeCoinFLORKSurges

Pin