208x speedup + 5 microsecond prediction, running KMeans directly on H200 reaches 61% peak FLOPS, Berkeley folks have made Triton go all out

View Original
CoinNetwork
Up to 208x speedup for cuML—FlashLib, an open-source classic ML acceleration library from UC Berkeley and others
Crypto界 claims that the FlashLib open-source machine learning acceleration library disclosed by OneMillion_AI was developed by teams such as Berkeley, covering 15 high-level operators. Based on Triton and Cutedsl, it achieves significant speedups for operators such as KMeans and KNN on H200 GPUs, up to 208x. Compared with cuML 25.10, KMeans is 26x faster, KNN is 19x faster, HDBSCAN is 40x faster, and TruncatedSVD is 208x faster. KMeans peak FLOPS reaches 61%, and KNN bandwidth is 85.2%. In addition, FlashLib also provides a performance prediction API within 5 microseconds to estimate runtime and GPU memory overhead; the code has been open-sourced on GitHub.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned