Sakana AI partners with NVIDIA: enabling GPUs to skip 80% of ineffective computations in large models, boosting H100 inference speed by 30%

robot
Abstract generation in progress

According to Beating Monitoring, Sakana AI in collaboration with NVIDIA has open-sourced a sparse data format called TwELL and its supporting acceleration kernel, successfully enabling GPUs to skip “nearly zero” ineffective computations when running large models. This solution, without sacrificing model accuracy, boosts inference speed on H100 by up to 30%, accelerates training by up to 24%, and significantly reduces peak memory usage.

The feed-forward layer (FFN) of large models consumes the majority of parameters and computational power. However, in reality, more than 80% of neurons are in a “dormant state” (activation values close to zero) during text generation, contributing nothing to the final result. Skipping these neurons can save enormous amounts of computation. Yet, modern GPUs are inherently optimized for uniform dense matrix calculations. Using traditional methods to pick out scattered useful data incurs overhead from searching and reading data back and forth, which consumes all the saved computation.

The TwELL format is designed to break this hardware bottleneck. It aligns perfectly with GPU parallel logic: instead of assembling non-zero data across regions as in traditional methods, it slices data into small blocks (tiles) that GPUs handle most efficiently. This allows each GPU core to directly pack useful data locally, eliminating time-consuming global memory reads and writes, and seamlessly integrating into modern chip acceleration pipelines.

In tests with a 1.5 billion parameter model, only a slight regularization during training was needed to reduce the proportion of neurons requiring actual computation to less than 2%, with performance on seven downstream tasks remaining unaffected. Data also reveals a pattern: the larger the model parameters, the more neurons are in a dormant state (the non-zero ratio in a 2 billion parameter model is 38% lower than in a 500 million parameter model). This indicates that as future large models grow bigger, this hardware-oriented optimization will unlock even more significant performance benefits.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin