Yifan Zhang discloses DeepSeek V4 complete technical specifications: 1.6T parameters, 384 expert activations for 6

robot
Abstract generation in progress

ME News report: On April 22 (UTC+8), according to Beating monitoring, Princeton PhD student Yifan Zhang posted technical details of DeepSeek V4 on X. He teased “V4 next week” on April 19 and listed three architectural component names. Tonight, he provided the complete parameter table and, for the first time, disclosed the existence of a lightweight version, V4-Lite, with 285B parameters. V4 has a total parameter count of 1.6T. The attention mechanism is DSA2, combining two sparse attention approaches previously used by DeepSeek in V3.2—DSA (DeepSeek Sparse Attention) and NSA (Native Sparse Attention), proposed in a paper earlier this year—along with head-dim 512, paired with Sparse MQA and SWA (Sliding Window Attention). The MoE layer has 384 experts in total; it activates 6 each time and uses the Fused MoE Mega-Kernel. The residual connections continue to use Hyper-Connections.

Details disclosed for the first time at the training stage include: the optimizer Muon (a matrix-level optimizer that applies Newton-Schulz orthogonalization to momentum updates), a pre-training context length of 32K, and the use of GRPO during the reinforcement learning phase with added KL divergence correction. The final context length is extended to 1M. The modality is pure text. Zhang is not employed by DeepSeek, and DeepSeek officials have not responded to the above information. (Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin