According to Beating Monitoring, Princeton PhD student Yifan Zhang updated the technical details of DeepSeek V4 on X.
He previewed “V4 next week” on April 19 and listed three architecture component names,
and tonight released the complete parameter table, while also disclosing for the first time a lightweight version V4-Lite with 285B parameters.
V4 total parameters 1.6T.
The attention mechanism is DSA2, combining the two sparse attention schemes used previously in DeepSeek V3.2—DSA (DeepSeek Sparse Attention) and NSA (Native Sparse Attention) proposed earlier this year,
with head-dim 512, coupled with Sparse MQA and SWA (Sliding Window Attention).
The MoE layer has a total of 384 experts, activating 6 each time, using Fused MoE Mega-Kernel.
Residual connections continue to use Hyper-Connections.
Training details first disclosed include:
The optimizer uses Muon (a matrix-level optimizer applying Newton-Schulz orthogonalization to momentum updates),
pre-training context length of 32K,
reinforcement learning phase using GRPO with KL divergence correction.
The final context length is extended to 1M.
Modal is pure text.
Zhang is not employed at DeepSeek, and DeepSeek official has not responded to the above information.
Yifan Zhang披露DeepSeek V4完整技术规格:1.6T参数、384专家激活6个
According to Beating Monitoring, Princeton PhD student Yifan Zhang updated the technical details of DeepSeek V4 on X.
He previewed “V4 next week” on April 19 and listed three architecture component names,
and tonight released the complete parameter table, while also disclosing for the first time a lightweight version V4-Lite with 285B parameters.
V4 total parameters 1.6T.
The attention mechanism is DSA2, combining the two sparse attention schemes used previously in DeepSeek V3.2—DSA (DeepSeek Sparse Attention) and NSA (Native Sparse Attention) proposed earlier this year,
with head-dim 512, coupled with Sparse MQA and SWA (Sliding Window Attention).
The MoE layer has a total of 384 experts, activating 6 each time, using Fused MoE Mega-Kernel.
Residual connections continue to use Hyper-Connections.
Training details first disclosed include:
The optimizer uses Muon (a matrix-level optimizer applying Newton-Schulz orthogonalization to momentum updates),
pre-training context length of 32K,
reinforcement learning phase using GRPO with KL divergence correction.
The final context length is extended to 1M.
Modal is pure text.
Zhang is not employed at DeepSeek, and DeepSeek official has not responded to the above information.