Lighthouse Attention is a pretty clever idea—start with rough screening, then do detailed (precise) calculation. Long texts finally don’t have to be hard to read anymore.

View Original
MeNews
Nous开源Lighthouse Attention:单B200跑512K提速17倍
AIMPACT states that Nous Research has open-sourced the long-context pretraining mechanism Lighthouse Attention. Processing a 512K text on a single B200 card is about 17 times faster, and at 98K it speeds up end-to-end by 1.4–1.7 times. This mechanism first does coarse filtering and then precise calculation: it uses multi-level summaries to screen out the core segments and stitch them into short text, which is then processed by FlashAttention. Since the filtering logic is outside the kernel, it eliminates the need for low-level code and additional training objectives. To prevent the model from losing character-by-character reading ability due to jumping reading, during training it completes most of the process in the accelerated mode, then briefly switches back to full attention at the end. In experiments with 5.3 billion parameters and 50 billion tokens, the time consumption drops significantly, and the final performance is comparable to, and even surpasses, traditional baselines.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned