Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
Sakana AI partners with NVIDIA: enabling GPUs to skip 80% of ineffective computations in large models, boosting H100 inference speed by 30%
According to Beating Monitoring, Sakana AI in collaboration with NVIDIA has open-sourced a sparse data format called TwELL and its supporting acceleration kernel, successfully enabling GPUs to skip “nearly zero” ineffective computations when running large models. This solution, without sacrificing model accuracy, boosts inference speed on H100 by up to 30%, accelerates training by up to 24%, and significantly reduces peak memory usage.
The feed-forward layer (FFN) of large models consumes the majority of parameters and computational power. However, in reality, more than 80% of neurons are in a “dormant state” (activation values close to zero) during text generation, contributing nothing to the final result. Skipping these neurons can save enormous amounts of computation. Yet, modern GPUs are inherently optimized for uniform dense matrix calculations. Using traditional methods to pick out scattered useful data incurs overhead from searching and reading data back and forth, which consumes all the saved computation.
The TwELL format is designed to break this hardware bottleneck. It aligns perfectly with GPU parallel logic: instead of assembling non-zero data across regions as in traditional methods, it slices data into small blocks (tiles) that GPUs handle most efficiently. This allows each GPU core to directly pack useful data locally, eliminating time-consuming global memory reads and writes, and seamlessly integrating into modern chip acceleration pipelines.
In tests with a 1.5 billion parameter model, only a slight regularization during training was needed to reduce the proportion of neurons requiring actual computation to less than 2%, with performance on seven downstream tasks remaining unaffected. Data also reveals a pattern: the larger the model parameters, the more neurons are in a dormant state (the non-zero ratio in a 2 billion parameter model is 38% lower than in a 500 million parameter model). This indicates that as future large models grow bigger, this hardware-oriented optimization will unlock even more significant performance benefits.