Dual-layer attention with coarse filtering plus precise computation: switches back to full attention in the later training stage to prevent skipping reads, and the 5.3B model’s performance not only doesn’t drop but improves. With long context, you no longer need to stack GPUs and brute-force your way through it.

View Original
MeNews
Nous开源Lighthouse Attention:单B200跑512K提速17倍
AIMPACT states that Nous Research has open-sourced the long-context pretraining mechanism Lighthouse Attention. Processing 512K text on a single B200 card is about 17 times faster; at 98K, the end-to-end speedup is 1.4–1.7 times. This mechanism first performs rough screening and then precise computation: it uses multi-level summaries to filter out the core segments and stitches them into short texts, which are then handled by FlashAttention. The screening logic is outside the kernel, eliminating the need to modify low-level code and additional training objectives. To prevent the model from losing its character-by-character reading ability due to jumping reading, during training it completes most of the process in an accelerated mode, then briefly switches back to full attention at the end. In experiments with 5.3亿 parameters and 500亿 Token, time consumption decreases significantly, and the final performance is comparable to, and even exceeds, traditional baselines.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned