Yifan Zhang Reveals Complete Technical Specifications of DeepSeek V4: 1.6T Parameters, 384 Experts with 6 Activated

According to monitoring by Dongcha Beating, Princeton PhD student Yifan Zhang updated the technical details of DeepSeek V4 on X. He previewed ‘V4 next week’ on April 19 and listed three architecture component names, providing a complete parameter table tonight, while also disclosing for the first time the existence of a lightweight version, V4-Lite, with 285B parameters. The total parameters for V4 are 1.6T. The attention mechanism is DSA2, which combines two sparse attention schemes: DSA (DeepSeek Sparse Attention) used in V3.2 and NSA (Native Sparse Attention) proposed in a paper earlier this year. The head dimension is 512, paired with Sparse MQA and SWA (Sliding Window Attention). The MoE layer has a total of 384 experts, with 6 activated at a time, using Fused MoE Mega-Kernel. Residual connections follow Hyper-Connections. Details disclosed for the training phase include: the optimizer used is Muon (a matrix-level optimizer applying Newton-Schulz orthogonalization to momentum updates), with a pre-training context length of 32K, and reinforcement learning phase using GRPO with KL divergence correction added. The final context length is extended to 1M. The modality is pure text. Zhang does not hold a position at DeepSeek, and DeepSeek has not responded to the above information.

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin