Nous Research发布Lighthouse Attention,长序列预训练提速1.4-1.7倍

AIMPACT News, May 17 (UTC+8), Nous Research introduces the Lighthouse Attention method, which addresses the quadratic growth of attention computation costs in long sequence pretraining by using a selective hierarchical attention mechanism. This method performs symmetric pooling on Query, Key, and Value, with the selection logic outside the attention kernel, enabling reuse of the FlashAttention kernel, and employs a two-stage training strategy. Empirical tests on NVIDIA B200 show a 21x speedup in forward propagation at around 512K context length, a 17.3x combined speedup for forward + backward, with the first stage achieving a throughput of 126k tokens/sec/GPU (compared to 46k for dense SDPA). End-to-end acceleration ranges from 1.40× to 1.69×, while maintaining matching or lower training loss. Validation on a 530M parameter Llama-3 style model shows that the three Lighthouse runs achieve final losses (0.698-0.71) better than the from-scratch trained dense SDPA baseline (0.7237), saving 22.5-27 hours of training time. Paper: arXiv:2605.06554.

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned