ME News reports that FlashKDA is an open-source inference acceleration tool designed for NVIDIA Hopper, licensed under MIT, focusing on Kimi Linear's KDA attention. Rewritten using CUTLASS, it is approximately 1.7–2.2 times faster than the Triton version on Hopper for forward passes, suitable for variable-length inputs and batch scenarios. It only supports forward inference; training still uses Triton. Hardware requirements include Hopper+, CUDA 12.9, and PyTorch 2.4+, and it has been integrated into the upstream of fla (PR #852). Switching requires changing only one line of configuration.

MeNews

2026-04-22 02:01:40

Abstract generation in progress

ME News message, April 22 (UTC+8). According to monitoring by Dongcha Beating, the Dark Side of the Moon has open-sourced FlashKDA on GitHub—a set of tools specifically designed to accelerate model inference on NVIDIA Hopper series GPUs (H100, H20, etc.)—under the MIT license. It targets KDA, the new attention mechanism proposed last year by the Dark Side of the Moon in the Kimi Linear paper. When large models read long texts, the computational cost of traditional attention expands at a quadratic rate with sequence length; linear attention reduces this cost to linear growth. KDA is an improved version along this route. The Kimi Linear model architecture alternates 3 layers of KDA with 1 layer of traditional attention.

Previously, there was already a Triton-language version of KDA in the open-source library flash-linear-attention (abbreviated as fla). FlashKDA has now been rewritten using NVIDIA’s low-level GPU library CUTLASS, specifically to extract maximum performance from Hopper GPUs. In official tests on the H20, for the same forward computation, FlashKDA is 1.7 to 2.2 times faster than the Triton version. The speedup is especially noticeable in scenarios where input lengths vary and batching is used to run multiple batches. However, the official comparison only benchmarks against their own Triton version and does not compare with other linear attention approaches.

This time, only the forward computation has been open-sourced—meaning you can only “run the model” (inference), but cannot “train the model”; training still requires the original Triton version. Requirements: Hopper and later GPUs (starting with the SM90 architecture), CUDA 12.9 or above, and PyTorch 2.4 or above. FlashKDA also serves as a new backend merged upstream into fla (PR #852). For existing users switching over, it only takes changing one line of configuration.

(Source: BlockBeats)

KDA2.51%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
Gate13thAnniversaryLive
1.11M Popularity
#
WCTCTradingChallengeShare8MUSDT
751.89K Popularity
#
BitcoinBouncesBack
183.2K Popularity
#
USIranTalksProgress
560.15K Popularity
#
ArbitrumFreezesKelpDAOHackerETH
38.5K Popularity

Sitemap

The Dark Side of the Moon open-source FlashKDA, Kimi Linear inference speed improved by 1.7 to 2.2 times

Trending Topics

Gate13thAnniversaryLive

WCTCTradingChallengeShare8MUSDT

BitcoinBouncesBack

USIranTalksProgress

ArbitrumFreezesKelpDAOHackerETH

Pin