Yifan Zhang披露DeepSeek V4完整技术规格:1.6T参数、384专家激活6个

robot
概要作成中

According to Beating Monitoring, Princeton PhD student Yifan Zhang updated the technical details of DeepSeek V4 on X.
He previewed “V4 next week” on April 19 and listed three architecture component names,
and tonight released the complete parameter table, while also disclosing for the first time a lightweight version V4-Lite with 285B parameters.

V4 total parameters 1.6T.
The attention mechanism is DSA2, combining the two sparse attention schemes used previously in DeepSeek V3.2—DSA (DeepSeek Sparse Attention) and NSA (Native Sparse Attention) proposed earlier this year,
with head-dim 512, coupled with Sparse MQA and SWA (Sliding Window Attention).
The MoE layer has a total of 384 experts, activating 6 each time, using Fused MoE Mega-Kernel.
Residual connections continue to use Hyper-Connections.

Training details first disclosed include:
The optimizer uses Muon (a matrix-level optimizer applying Newton-Schulz orthogonalization to momentum updates),
pre-training context length of 32K,
reinforcement learning phase using GRPO with KL divergence correction.
The final context length is extended to 1M.
Modal is pure text.

Zhang is not employed at DeepSeek, and DeepSeek official has not responded to the above information.

このページには第三者のコンテンツが含まれている場合があり、情報提供のみを目的としております(表明・保証をするものではありません)。Gateによる見解の支持や、金融・専門的な助言とみなされるべきものではありません。詳細については免責事項をご覧ください。
  • 報酬
  • コメント
  • リポスト
  • 共有
コメント
コメントを追加
コメントを追加
コメントなし
  • ピン