CoinWorld News, the Qwen team has open-sourced FlashQLA, a high-performance operator library designed for GDN (gated delta network, the linear attention layer used in the entire Qwen3-next / 3.5 / 3.6 series).


In tests on H200, the forward computation speed is 2-3 times faster than Fla Triton kernel, and the backward computation speed is twice as fast.
In the TP8 scenario, the forward computation speed can reach up to 5.33 times faster.
The core of the speedup lies in utilizing the exponential decay characteristic of GDN gating values to achieve automatic context parallelism (autocp), bypassing the step of calculating the correction matrix in traditional methods.
The system automatically determines whether to enable CP based on batch size, number of heads, and sequence length, without manual configuration.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments