A reasoning monster fed on 340,000 trajectories—SU-01, the name itself is pretty something.

View Original
MeNews
The post-training inference model SU-01 achieves gold medal performance in Olympiad-level questions
AIMPACT proposes a systematic approach to transforming post-training inference models into Olympic-level problem solvers, consisting of three steps: supervised fine-tuning with inverse perplexity curriculum to instill proof search and self-checking; then extended with two-stage reinforcement learning; and finally scaled-up enhancement during testing. Applied to the 30B-A3B backbone, using approximately 340k sub-8K trajectories for supervised fine-tuning, followed by 200 steps of RL, resulting in SU-01. This model can perform stable reasoning on difficult problems, with trajectories exceeding 100k tokens, achieving gold medal levels in competitions such as IMO, USAMO, and IPhO, and demonstrating cross-domain scientific reasoning generalization beyond mathematics and physics.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned