Cursor Unveils MoE Inference Optimization Technology Warp Decode, Achieving 1.84x Throughput Improvement on Blackwell GPU

AirdropBlackHole · 2026-04-07T06:19:54+00:00

Cursor's technical blog introduces Warp Decode, an MoE inference acceleration method for NVIDIA GPUs. It shifts to an output-centric strategy, compressing computation stages, improving decoding throughput by 1.84x, and enhancing output accuracy while optimizing bandwidth utilization.

AirdropBlackHole

2026-04-07 06:19:54

Abstract generation in progress

According to monitoring by 1M AI News, AI programming tool Cursor has released a technical blog introducing its self-developed MoE (Mixture of Experts) inference acceleration method, Warp Decode. This method targets small-batch token generation scenarios on NVIDIA’s Blackwell GPU, flipping the traditional expert-centric parallel strategy to an output-centric approach: each warp (the smallest scheduling unit composed of 32 parallel processing units) in the GPU is responsible for calculating a single output value, independently traversing all routed experts and completing accumulation in registers without any cross-warp synchronization or intermediate buffers. The traditional MoE inference pipeline consists of 8 stages, 5 of which are solely for moving data for expert views without performing actual computations. Warp Decode compresses the entire MoE computation layer into 2 CUDA kernels, eliminating intermediate steps such as padding, scattering, and merging, reducing over 32KB of intermediate buffer read/write per token. Tested on NVIDIA’s B200 GPU with a Qwen-3 style model, Warp Decode achieved a 1.84x end-to-end decoding throughput improvement, and due to computing entirely in BF16/FP32 precision, it avoided intermediate quantization loss, resulting in an output accuracy that is 1.4 times closer to the FP32 benchmark compared to traditional paths. In terms of hardware bandwidth utilization, with a batch size of 32, it sustained a throughput of 3.95 TB/s, approximately 58% of the B200’s peak bandwidth (6.8 TB/s). This optimization directly accelerates the development iteration and version release pace of Cursor’s self-developed programming model, Composer.

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.