Reverse perplexity curriculum + two-stage reinforcement learning + expansion during testing, this combination breaks through the boundaries of mathematics and physics, and the true hidden weapon is the generalization of scientific reasoning.

View Original
MeNews
The post-training inference model SU-01 achieves gold medal performance on Olympiad-level exam questions.
AIMPACT proposes a systematic approach to transforming post-training inference models into Olympic-level problem solvers, consisting of three steps: supervised fine-tuning with inverse perplexity curriculum to instill proof search and self-checking; then extended with two-stage reinforcement learning; and finally scaled-up during testing for enhancement. Applied to the 30B-A3B backbone, using approximately 340k sub-8K trajectories for supervised fine-tuning, followed by 200 steps of RL, resulting in SU-01. This model can perform stable reasoning on difficult problems, with trajectories exceeding 100k tokens, achieving gold medal levels in competitions such as IMO, USAMO, and IPhO, and demonstrating cross-domain scientific reasoning generalization beyond mathematics and physics.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned