Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
Huawei and USTC jointly break NVIDIA's monopoly, Ascend A3 accelerates large model expert computation speed by 58%
According to Beating Monitoring, in the evolution of large-scale MoE architectures, training large models using domestically developed Ascend chips has become a key direction for building autonomous and controllable AI computing power. However, most mainstream large-model frameworks are developed based on NVIDIA’s CUDA ecosystem, and direct porting to the Ascend platform often faces challenges such as uneven hardware queue scheduling and low compute utilization. The University of Science and Technology of China (USTC), Huawei, and Peking University jointly launched the HyperParallel-MoE compilation and scheduling framework, aiming to perform tile-level (tile-level) control over the unique hardware queues of the Ascend A3, so as to break through the energy-efficiency bottleneck of heterogeneous compute in parallel scheduling.
The Ascend A3 has two core types: the AIC is responsible for matrix multiplication, while the AIV handles vector computation and communication. However, under traditional operator serial scheduling, the two core types can only alternate—working in turns while idling in between. Actual measurements show that when running a 671B DeepSeek-style large model on a 256-node cluster, the AIC utilization is only 67%, and 39% of expert routing communication latency is exposed on the critical computation path.
HyperParallel-MoE involves three key changes. First, it designs an AIV-driven one-way write primitive so that computation is triggered as soon as data tiles arrive, without waiting for the entire batch to be ready. Second, it introduces dependency-aware tile task generation, unifying the abstraction of communication and computation operators. Third, it pre-generates a sequence of tasks with a static scheduler: within a single kernel, it drives both core types to run in parallel, and uses a high-speed L2 cache to share intermediate results, reducing the latency caused by writing back and reading from slow HBM memory.
Testing shows that under 64-node balanced routing, the latency of the core module responsible for expert computation (MoE-FFN) is reduced by about 36%, which corresponds to a maximum data processing speed improvement of 58% (i.e., a speedup from 1.49 to 1.58 times). In end-to-end operation on the full machine, single-step training speed is also improved by 8% to 9%. This indicates that the practical energy efficiency of Ascend depends not only on hardware specifications, but also on whether the compiler and runtime can efficiently schedule the AIC/AIV cores.