Huawei Chip Delays DeepSeek V4 Launch? Same Kernel Achieves Nearly Double Speed on NVIDIA Ascend

According to monitoring by Dongcha Beating, prior to the release of DeepSeek V4, there was widespread speculation in the community that the launch was delayed due to difficulties in adapting the model from NVIDIA to the Huawei Ascend platform. Although the V4 technical report did not directly address this rumor, the disclosed performance data contradicts it significantly. The report shows that the Fine-Grained Expert Partitioning Scheme (Fine-Grained EP Scheme) has been successfully deployed and validated on both NVIDIA GPUs and Huawei Ascend NPUs, achieving acceleration of 1.50 to 1.73 times for regular inference loads, and up to 1.96 times acceleration for latency-sensitive scenarios such as RL rollout and high-speed agent services. The team has also open-sourced the CUDA version kernel MegaMoE as part of DeepGEMM. In other words, V4 has demonstrated efficiency close to theoretical limits on both hardware platforms, and cross-platform adaptation has not resulted in performance loss.

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin