17x speedup without needing to modify the underlying CUDA; Nous's design is simply a shortcut for long-text reasoning, and the detail of switching back to full attention at the end of training is very stable.

View Original
MeNews
Nous开源Lighthouse Attention:单B200跑512K提速17倍
AIMPACT states that Nous Research has open-sourced the long-context pretraining mechanism Lighthouse Attention. Processing 512K text on a single B200 card is about 17 times faster, and at 98K, end-to-end speed is increased by 1.4–1.7 times. The mechanism first performs rough screening, then precise calculation: it filters out core segments using multi-level summaries and stitches them into short texts, which are then handled by FlashAttention; the filtering logic sits outside the kernel, avoiding changes to low-level code and additional training objectives. To prevent the model from losing character-by-character reading ability due to jumping reading, during training it completes most of the process in accelerated mode, and briefly switches back to full attention at the end. In experiments with 5.3 hundred million parameters and 500 hundred million tokens, the time consumed drops significantly, and the final performance is comparable to, and even surpasses, the traditional baselines.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned