Huawei and USTC jointly break NVIDIA's monopoly, Ascend A3 accelerates large model expert computation speed by 58%

According to Beating Monitoring, in the evolution of large-scale MoE architectures, training large models using domestically developed Ascend chips has become a key direction for building autonomous and controllable AI computing power. However, most mainstream large-model frameworks are developed based on NVIDIA’s CUDA ecosystem, and direct porting to the Ascend platform often faces challenges such as uneven hardware queue scheduling and low compute utilization. The University of Science and Technology of China (USTC), Huawei, and Peking University jointly launched the HyperParallel-MoE compilation and scheduling framework, aiming to perform tile-level (tile-level) control over the unique hardware queues of the Ascend A3, so as to break through the energy-efficiency bottleneck of heterogeneous compute in parallel scheduling.

The Ascend A3 has two core types: the AIC is responsible for matrix multiplication, while the AIV handles vector computation and communication. However, under traditional operator serial scheduling, the two core types can only alternate—working in turns while idling in between. Actual measurements show that when running a 671B DeepSeek-style large model on a 256-node cluster, the AIC utilization is only 67%, and 39% of expert routing communication latency is exposed on the critical computation path.

HyperParallel-MoE involves three key changes. First, it designs an AIV-driven one-way write primitive so that computation is triggered as soon as data tiles arrive, without waiting for the entire batch to be ready. Second, it introduces dependency-aware tile task generation, unifying the abstraction of communication and computation operators. Third, it pre-generates a sequence of tasks with a static scheduler: within a single kernel, it drives both core types to run in parallel, and uses a high-speed L2 cache to share intermediate results, reducing the latency caused by writing back and reading from slow HBM memory.

Testing shows that under 64-node balanced routing, the latency of the core module responsible for expert computation (MoE-FFN) is reduced by about 36%, which corresponds to a maximum data processing speed improvement of 58% (i.e., a speedup from 1.49 to 1.58 times). In end-to-end operation on the full machine, single-step training speed is also improved by 8% to 9%. This indicates that the practical energy efficiency of Ascend depends not only on hardware specifications, but also on whether the compiler and runtime can efficiently schedule the AIC/AIV cores.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 7
  • Repost
  • Share
Comment
Add a comment
Add a comment
GateUser-76dcd439
· 1h ago
Domestic chips finally have a dedicated optimization framework for MoE, and the HyperParallel-MoE tile-level scheduling approach is quite detailed.
View OriginalReply0
TreatEarningsAsSnacks
· 5h ago
The CUDA ecosystem moat is too deep; domestic replacements can't just copy it directly and require this kind of fundamental restructuring.
View OriginalReply0
CapitalFlowInATeacup
· 5h ago
Self-control is not just a slogan; it is built from these lines of code.
View OriginalReply0
LiquidityLifeguard
· 5h ago
Peking University focuses on systems, University of Science and Technology of China on architecture, Huawei implements it, this model of industry-university-research collaboration is the right fit.
View OriginalReply0
BridgeSideEyes
· 5h ago
The low utilization rate of computing power has always been a pain point for Ascend. How much can this be improved this time? Are there any data?
View OriginalReply0
GateUser-de0b9e3b
· 5h ago
Huawei is serious about developing compilers. From MindSpore to this framework, the ecosystem is gradually being built up.
View OriginalReply0
GateUser-26374bb4
· 5h ago
MoE is inherently dependent on scheduling; domestic chips must focus on these details to catch up.
View OriginalReply0
  • Pinned