According to Beating Monitoring, in the evolution of large-scale MoE architectures, training large models using domestically developed Ascend chips has become a key direction for building autonomous and controllable AI computing power. However, most mainstream large-model frameworks are developed based on NVIDIA’s CUDA ecosystem, and direct porting to the Ascend platform often faces challenges such as uneven hardware queue scheduling and low compute utilization. The University of Science and Technology of China (USTC), Huawei, and Peking University jointly launched the HyperParallel-MoE compilation and scheduling framework, aiming to perform tile-level (tile-level) control over the unique hardware queues of the Ascend A3, so as to break through the energy-efficiency bottleneck of heterogeneous compute in parallel scheduling.

The Ascend A3 has two core types: the AIC is responsible for matrix multiplication, while the AIV handles vector computation and communication. However, under traditional operator serial scheduling, the two core types can only alternate—working in turns while idling in between. Actual measurements show that when running a 671B DeepSeek-style large model on a 256-node cluster, the AIC utilization is only 67%, and 39% of expert routing communication latency is exposed on the critical computation path.

HyperParallel-MoE involves three key changes. First, it designs an AIV-driven one-way write primitive so that computation is triggered as soon as data tiles arrive, without waiting for the entire batch to be ready. Second, it introduces dependency-aware tile task generation, unifying the abstraction of communication and computation operators. Third, it pre-generates a sequence of tasks with a static scheduler: within a single kernel, it drives both core types to run in parallel, and uses a high-speed L2 cache to share intermediate results, reducing the latency caused by writing back and reading from slow HBM memory.

Testing shows that under 64-node balanced routing, the latency of the core module responsible for expert computation (MoE-FFN) is reduced by about 36%, which corresponds to a maximum data processing speed improvement of 58% (i.e., a speedup from 1.49 to 1.58 times). In end-to-end operation on the full machine, single-step training speed is also improved by 8% to 9%. This indicates that the practical energy efficiency of Ascend depends not only on hardware specifications, but also on whether the compiler and runtime can efficiently schedule the AIC/AIV cores.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

11 Likes

Reward
11
7
Repost
Share

Comment

Add a comment

GateUser-76dcd439

· 1h ago

Domestic chips finally have a dedicated optimization framework for MoE, and the HyperParallel-MoE tile-level scheduling approach is quite detailed.

View OriginalReply0

TreatEarningsAsSnacks

· 5h ago

The CUDA ecosystem moat is too deep; domestic replacements can't just copy it directly and require this kind of fundamental restructuring.

View OriginalReply0

CapitalFlowInATeacup

· 5h ago

Self-control is not just a slogan; it is built from these lines of code.

View OriginalReply0

LiquidityLifeguard

· 5h ago

Peking University focuses on systems, University of Science and Technology of China on architecture, Huawei implements it, this model of industry-university-research collaboration is the right fit.

View OriginalReply0

BridgeSideEyes

· 5h ago

The low utilization rate of computing power has always been a pain point for Ascend. How much can this be improved this time? Are there any data?

View OriginalReply0

GateUser-de0b9e3b

· 5h ago

Huawei is serious about developing compilers. From MindSpore to this framework, the ecosystem is gradually being built up.

View OriginalReply0

GateUser-26374bb4

· 5h ago

MoE is inherently dependent on scheduling; domestic chips must focus on these details to catch up.

View OriginalReply0

Trending Topics
View More
#
StockTradingChallengeUpTo17000U
15.92M Popularity
#
USIranDraftDeal
287.06K Popularity
#
TradeCFDWinGold
3.04M Popularity
#
HYPEMarketCapSurpassesDOGE
12.64M Popularity
#
PlatinumCardCreatorExclusive
155.84K Popularity

Pinned

Sitemap

Huawei and USTC jointly break NVIDIA's monopoly, Ascend A3 accelerates large model expert computation speed by 58%

Trending Topics

StockTradingChallengeUpTo17000U

USIranDraftDeal

TradeCFDWinGold

HYPEMarketCapSurpassesDOGE

PlatinumCardCreatorExclusive

Pinned