Reverse Perplexity course + two-stage RL + test-time scaling—this combo punch has once again pierced the ceiling of post-training inference models.

View Original
MeNews
The post-training inference model SU-01 achieves gold medal performance in Olympiad-level questions
AIMPACT proposes a systematic approach to convert post-training inference models into Olympic-level problem solvers, in three steps: supervised fine-tuning with an inverse perplexity curriculum to instill proof search and self-checking; then expanded through two-stage reinforcement learning; and finally, perform scaling improvements during testing. Applied to the 30B-A3B backbone, it uses approximately 340k sub-8K trajectories for supervised fine-tuning, followed by 200 steps of RL, to obtain SU-01. This model can carry out stable reasoning on challenging problems; its trajectories exceed 100k tokens. In competitions such as IMO, USAMO, and IPhO, it achieves gold-medal-level performance, and also demonstrates scientific reasoning generalization across fields beyond mathematics and physics.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned