AIMPACT News, May 16 (UTC+8), a new paper proposes a systematic approach to convert post-training inference models into Olympic-level problem solvers and trains the SU-01 model based on this method.
The approach includes three steps: first, supervised fine-tuning using a reverse perplexity curriculum to instill strict proof search and self-checking behaviors;
then, expanding these behaviors through two-stage reinforcement learning (transitioning from verifiable reward reinforcement learning to proof-level reinforcement learning);
finally, improving performance through scaling during testing.
The research team applied the method to a 30B-A3B backbone model, using approximately 340k sub-8K token trajectories for supervised fine-tuning, followed by 200 steps of reinforcement learning to obtain SU-01.
This model can perform stable reasoning on difficult problems, with trajectory lengths exceeding 100k tokens, achieving gold medal levels in competitions such as IMO 2025/USAMO 2026 and IPhO 2024/2025, and demonstrating generalization capabilities beyond mathematics and physics in scientific reasoning.
(Source: InFoQ)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

14 Likes

Reward
14
11
2
Share

Comment

Add a comment

SeaSaltMintCandy

· 20h ago

Does the name SU-01 have any meaning, or was it just randomly chosen?

View OriginalReply0

StainedGlassSolarArray

· 21h ago

Post-training transformation of this idea, other labs should follow suit soon.

View OriginalReply0

GateUser-d2929483

· 21h ago

If this work is real, the contest problem data will need to be increased in price.

View OriginalReply0

StopRaisingGasFees.

· 21h ago

Can 200-step reinforcement learning converge? Or is it just an publicly available number?

View OriginalReply0

MetalFrameBookPageCross

· 21h ago

What exactly does two-stage RL extension refer to? Are there any details?

View OriginalReply0

GateUser-7a050ee5

· 21h ago

Waiting for open source or detailed technical reports, marking it for now.

View OriginalReply0

GateUser-f4b3df7a

· 21h ago

How is the self-check mechanism implemented, and does it have a separate training objective?

View OriginalReply0

GateUser-e3701961

· 21h ago

During testing, is scale boost self-consistency or another technique?

View OriginalReply0

LittleBitcoinInTheReflection

· 21h ago

A 30B-A3B scale can achieve this, with efficiency much higher than GPT-4.

View OriginalReply0

HalfLifeHodler

· 21h ago

The ability to generalize across domains is the most worth paying attention to; don't let it be just overfitting on benchmarks again.

View OriginalReply0

Trending Topics
View More
#
StockTradingChallengeUpTo17000U
16.02M Popularity
#
TrumpBacksCFTCAuthorityOverPredictionMarkets
831.86K Popularity
#
GatePredictionMarketAddsSmartMoneyTracking
13.24M Popularity
#
MicronMarketCapBreaks1Trillion
44.96K Popularity
#
TradeCFDWinGold
3.09M Popularity

Pinned

Sitemap

The post-training inference model SU-01 achieves gold medal performance in Olympiad-level questions

Trending Topics

StockTradingChallengeUpTo17000U

TrumpBacksCFTCAuthorityOverPredictionMarkets

GatePredictionMarketAddsSmartMoneyTracking

MicronMarketCapBreaks1Trillion

TradeCFDWinGold

Pinned