Beating 監測によると、プリンストン大学の博士課程の学生 Zhang Yifan は X で DeepSeek V4 の重要な詳細を明らかにした：総パラメータ1.6T、V4-Lite 285B；DSA2 アテンション、DSA/NSA 融合、512 ヘッドの次元、Sparse MQA とスライドウィンドウをサポート；MoE384 専門家、6つをアクティブ化、Fused MoE Mega-Kernel を採用、Hyper-Connections 残差。トレーニングには Muon オプティマイザを使用し、32K のコンテキスト、RL フェーズで GRPO+KL、最終的に1Mに拡張、モーダルは純粋なテキスト。Zhang は DeepSeek に所属しておらず、公式からの回答はない。

BlockBeatNews

2026-04-22 14:06:01

概要作成中

According to Beating Monitoring, Princeton PhD student Yifan Zhang updated the technical details of DeepSeek V4 on X.
He previewed “V4 next week” on April 19 and listed three architecture component names,
and tonight released the complete parameter table, while also disclosing for the first time a lightweight version V4-Lite with 285B parameters.

V4 total parameters 1.6T.
The attention mechanism is DSA2, combining the two sparse attention schemes used previously in DeepSeek V3.2—DSA (DeepSeek Sparse Attention) and NSA (Native Sparse Attention) proposed earlier this year,
with head-dim 512, coupled with Sparse MQA and SWA (Sliding Window Attention).
The MoE layer has a total of 384 experts, activating 6 each time, using Fused MoE Mega-Kernel.
Residual connections continue to use Hyper-Connections.

Training details first disclosed include:
The optimizer uses Muon (a matrix-level optimizer applying Newton-Schulz orthogonalization to momentum updates),
pre-training context length of 32K,
reinforcement learning phase using GRPO with KL divergence correction.
The final context length is extended to 1M.
Modal is pure text.

Zhang is not employed at DeepSeek, and DeepSeek official has not responded to the above information.

このページには第三者のコンテンツが含まれている場合があり、情報提供のみを目的としております（表明・保証をするものではありません）。Gateによる見解の支持や、金融・専門的な助言とみなされるべきものではありません。詳細については免責事項をご覧ください。