Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Launchpad
Be early to the next big token project
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Cursor Unveils MoE Inference Optimization Technology Warp Decode, Achieving 1.84x Throughput Improvement on Blackwell GPU
According to monitoring by 1M AI News, AI programming tool Cursor has released a technical blog introducing its self-developed MoE (Mixture of Experts) inference acceleration method, Warp Decode. This method targets small-batch token generation scenarios on NVIDIA’s Blackwell GPU, flipping the traditional expert-centric parallel strategy to an output-centric approach: each warp (the smallest scheduling unit composed of 32 parallel processing units) in the GPU is responsible for calculating a single output value, independently traversing all routed experts and completing accumulation in registers without any cross-warp synchronization or intermediate buffers. The traditional MoE inference pipeline consists of 8 stages, 5 of which are solely for moving data for expert views without performing actual computations. Warp Decode compresses the entire MoE computation layer into 2 CUDA kernels, eliminating intermediate steps such as padding, scattering, and merging, reducing over 32KB of intermediate buffer read/write per token. Tested on NVIDIA’s B200 GPU with a Qwen-3 style model, Warp Decode achieved a 1.84x end-to-end decoding throughput improvement, and due to computing entirely in BF16/FP32 precision, it avoided intermediate quantization loss, resulting in an output accuracy that is 1.4 times closer to the FP32 benchmark compared to traditional paths. In terms of hardware bandwidth utilization, with a batch size of 32, it sustained a throughput of 3.95 TB/s, approximately 58% of the B200’s peak bandwidth (6.8 TB/s). This optimization directly accelerates the development iteration and version release pace of Cursor’s self-developed programming model, Composer.