Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
The Dark Side of the Moon open-source FlashKDA, Kimi Linear inference speed improved by 1.7 to 2.2 times
ME News message, April 22 (UTC+8). According to monitoring by Dongcha Beating, the Dark Side of the Moon has open-sourced FlashKDA on GitHub—a set of tools specifically designed to accelerate model inference on NVIDIA Hopper series GPUs (H100, H20, etc.)—under the MIT license. It targets KDA, the new attention mechanism proposed last year by the Dark Side of the Moon in the Kimi Linear paper. When large models read long texts, the computational cost of traditional attention expands at a quadratic rate with sequence length; linear attention reduces this cost to linear growth. KDA is an improved version along this route. The Kimi Linear model architecture alternates 3 layers of KDA with 1 layer of traditional attention.
Previously, there was already a Triton-language version of KDA in the open-source library flash-linear-attention (abbreviated as fla). FlashKDA has now been rewritten using NVIDIA’s low-level GPU library CUTLASS, specifically to extract maximum performance from Hopper GPUs. In official tests on the H20, for the same forward computation, FlashKDA is 1.7 to 2.2 times faster than the Triton version. The speedup is especially noticeable in scenarios where input lengths vary and batching is used to run multiple batches. However, the official comparison only benchmarks against their own Triton version and does not compare with other linear attention approaches.
This time, only the forward computation has been open-sourced—meaning you can only “run the model” (inference), but cannot “train the model”; training still requires the original Triton version. Requirements: Hopper and later GPUs (starting with the SM90 architecture), CUDA 12.9 or above, and PyTorch 2.4 or above. FlashKDA also serves as a new backend merged upstream into fla (PR #852). For existing users switching over, it only takes changing one line of configuration.
(Source: BlockBeats)