Sakana AI partners with NVIDIA: enabling GPUs to skip 80% of ineffective computations in large models, boosting H100 inference speed by 30%

robot
Abstract generation in progress

According to Beating Monitoring, Sakana AI in collaboration with NVIDIA has open-sourced a sparse data format called TwELL and its supporting acceleration kernel, successfully enabling GPUs to skip “near-zero” ineffective computations when running large models. This solution, without sacrificing model accuracy, boosts inference speed on H100 by up to 30%, accelerates training by up to 24%, and significantly reduces peak memory usage.
The feed-forward layer (FFN) of large models consumes the majority of parameters and computational power. In reality, during text generation, over 80% of neurons are in a “dormant state” (activation values close to zero), contributing nothing to the final result. Skipping these neurons can save massive computational resources.
However, modern GPUs are inherently optimized for calculating dense, uniform matrices. Using traditional methods to pick out scattered useful data incurs overhead from searching and reading data back and forth, which eats into the saved computational power.
TwELL format is designed to break this hardware bottleneck. It aligns perfectly with GPU parallel logic: instead of assembling non-zero data across regions as in traditional methods, it slices data into small blocks (tiles) that GPUs handle most efficiently.
This way, each GPU core can directly pack useful data locally, completely eliminating time-consuming global memory reads and writes, seamlessly integrating into the modern chip acceleration pipeline.
In tests with a 1.5 billion parameter model, just a slight regularization during training can reduce the proportion of neurons that need actual computation to less than 2%, with performance on seven downstream tasks remaining unchanged.
Data also reveals a pattern: the larger the model parameters, the more neurons are in a dormant state (the non-zero ratio in a 2 billion parameter model is 38% lower than in a 500 million parameter model).
This means that as future large-scale models grow, this hardware-oriented optimization will unlock even more significant performance benefits.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin