Yifan Zhang discloses DeepSeek V4 complete technical specifications: 1.6T parameters, 384 expert activations for 6

robot
Abstract generation in progress
ME News Report, April 22 (UTC+8), according to Beating Monitoring, Princeton PhD student Yifan Zhang updated the technical details of DeepSeek V4 on X. He announced "V4 next week" on April 19 and listed three architectural component names, tonight providing the complete parameter table, and for the first time disclosed a lightweight version V4-Lite with 285B parameters. V4 total parameters 1.6T. The attention mechanism is DSA2, combining the two sparse attention schemes previously used in DeepSeek V3.2—DSA (DeepSeek Sparse Attention) and NSA (Native Sparse Attention), proposed in a paper earlier this year, with head-dim 512, paired with Sparse MQA and SWA (Sliding Window Attention). The MoE layer has 384 experts, activating 6 each time, using Fused MoE Mega-Kernel. Residual connections continue with Hyper-Connections. Details disclosed for the first time at the training end include: optimizer using Muon (a matrix-level optimizer applying Newton-Schulz orthogonalization to momentum updates), pre-training context length of 32K, reinforcement learning phase using GRPO with KL divergence correction. The final context length extended to 1M. The modality is pure text. Zhang is not employed at DeepSeek, and DeepSeek officials have not responded to the above information. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments