TwELL, developed by Sakana AI and open-sourced by NVIDIA, organizes data into small blocks (tiles), directly packing useful data in local video memory, skipping invalid computations caused by near-zero activations in FFN, thereby improving GPU parallel efficiency. During training of a 1.5 billion parameter model, slight regularization reduces active neurons to less than 2%, with no change in seven downstream tasks; inference speed on H100 increases up to 30%, training speed up to 24%, and peak memory usage decreases. The larger the model, the more dormant neurons it has; the non-zero proportion in a 2 billion parameter model is 38% lower than that in a 500 million parameter model, indicating greater benefits for future large models.

BlockBeatNews

2026-05-10 04:50:47

Abstract generation in progress

According to Beating Monitoring, Sakana AI in collaboration with NVIDIA has open-sourced a sparse data format called TwELL and its supporting acceleration kernel, successfully enabling GPUs to skip “nearly zero” ineffective computations when running large models. This solution, without sacrificing model accuracy, boosts inference speed on H100 by up to 30%, accelerates training by up to 24%, and significantly reduces peak memory usage.

The feed-forward layer (FFN) of large models consumes the majority of parameters and computational power. However, in reality, more than 80% of neurons are in a “dormant state” (activation values close to zero) during text generation, contributing nothing to the final result. Skipping these neurons can save enormous amounts of computation. Yet, modern GPUs are inherently optimized for uniform dense matrix calculations. Using traditional methods to pick out scattered useful data incurs overhead from searching and reading data back and forth, which consumes all the saved computation.

The TwELL format is designed to break this hardware bottleneck. It aligns perfectly with GPU parallel logic: instead of assembling non-zero data across regions as in traditional methods, it slices data into small blocks (tiles) that GPUs handle most efficiently. This allows each GPU core to directly pack useful data locally, eliminating time-consuming global memory reads and writes, and seamlessly integrating into modern chip acceleration pipelines.

In tests with a 1.5 billion parameter model, only a slight regularization during training was needed to reduce the proportion of neurons requiring actual computation to less than 2%, with performance on seven downstream tasks remaining unaffected. Data also reveals a pattern: the larger the model parameters, the more neurons are in a dormant state (the non-zero ratio in a 2 billion parameter model is 38% lower than in a 500 million parameter model). This indicates that as future large models grow bigger, this hardware-oriented optimization will unlock even more significant performance benefits.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateSquareMayTradingShare
1.06M Popularity
#
BTCBackAbove80K
59.45M Popularity
#
JapanTokenizesGovernmentBonds
1.91M Popularity
#
DailyPolymarketHotspot
871.28K Popularity
#
WCTCTradingKingPK
766.46K Popularity

Sitemap

Sakana AI partners with NVIDIA: enabling GPUs to skip 80% of ineffective computations in large models, boosting H100 inference speed by 30%

Trending Topics

GateSquareMayTradingShare

BTCBackAbove80K

JapanTokenizesGovernmentBonds

DailyPolymarketHotspot

WCTCTradingKingPK

Pin