Lighthouse Attention is a pretty clever idea—first coarse screening, then fine calculation, pushing long context into FlashAttention's comfort zone without modifying the underlying CUDA. Running 512K on a single B200 card is nearly 17 times faster. During training finalization, switch back to full attention to preserve precision. The engineering approach feels very solid.

View Original
MeNews
Nous开源Lighthouse Attention:单B200跑512K提速17倍
AIMPACT states that Nous Research has open-sourced a long-context pretraining mechanism called Lighthouse Attention. Processing 512K text on a single B200 card is about 17 times faster, and at 98K tokens it achieves an end-to-end speedup of 1.4–1.7 times. The mechanism first performs rough filtering and then precise computation: it filters out the core segments using multi-level summaries and stitches them into short text, which is then processed by FlashAttention. Since the filtering logic is outside the kernel, it avoids changes to low-level code and additional training objectives. To prevent the model from losing character-by-character reading ability due to skipping around the text, during training it first completes most of the work in an accelerated mode and then briefly switches back to full attention at the end. In experiments with 530 million parameters and 500 billion tokens, time usage drops significantly, and the final performance is comparable to traditional baselines or even surpasses them.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned