AIMPACT News, May 16 (UTC+8), a new paper proposes a systematic approach to convert post-training inference models into Olympic-level problem solvers and trains the SU-01 model based on this method. The approach includes three steps: first, supervised fine-tuning using a reverse perplexity curriculum to instill strict proof search and self-checking behaviors; second, expanding these behaviors through a two-stage reinforcement learning process (transitioning from verifiable reward reinforcement learning to proof-level reinforcement learning); finally, enhancing performance through test-time scaling. The research team applied the method to a 30B-A3B backbone model, using approximately 340k sub-8K token trajectories for supervised fine-tuning, followed by 200 steps of reinforcement learning to produce SU-01. This model can perform stable reasoning on difficult problems, with trajectory lengths exceeding 100k tokens, achieving gold medal levels in competitions such as IMO 2025/USAMO 2026 and IPhO 2024/2025, and demonstrating generalization capabilities beyond mathematics and physics in scientific reasoning. (Source: InFoQ)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

7 Likes

Reward
7
8
Repost
Share

Comment

Add a comment

SucculentCross-Section

· 12m ago

IMO Gold Standard? Let's wait for open-source reproduction first.

View OriginalReply0

DeepBlueStakingStone

· 1h ago

340k trajectory data points are not actually exaggerated, but quality filtering is probably quite labor-intensive.

View OriginalReply0

BlackVelvetKeychain

· 6h ago

The reverse perplexity course design is quite interesting—it encodes human problem-solving (drill) experience into it.

View OriginalReply0

OrdersPlacedBeforeTheStorm

· 6h ago

If the self-check mechanism could be visualized, debugging the reasoning process would be much easier.

View OriginalReply0

VinesCoiledIntoGeometricShapes

· 6h ago

Physics competitions are also covered; now physics students have AI coaching partners.

View OriginalReply0

BridgeAnxiety

· 6h ago

What is the A3B architecture? Can someone knowledgeable elaborate?

View OriginalReply0

GateUser-ecf4759e

· 6h ago

Choosing the granularity of the 8K trajectory has its nuances; if it's too long, gradient propagation will explode.

View OriginalReply0

FudAlsoNeedsAnImage

· 6h ago

The last phrase "scientific reasoning generalization" makes me think of the Polanyi paradox — we know more than we can articulate, can AI now access that unspoken intuition?

View OriginalReply0

Trending Topics
View More
#
StockTradingChallengeUpTo17000U
16.22M Popularity
#
TrumpBacksCFTCAuthorityOverPredictionMarkets
816.65K Popularity
#
GatePredictionMarketAddsSmartMoneyTracking
13.2M Popularity
#
MicronMarketCapBreaks1Trillion
36.25K Popularity
#
TradeCFDWinGold
3.08M Popularity

Pinned

Sitemap

The post-training inference model SU-01 achieves gold medal performance in Olympiad-level questions

Trending Topics

StockTradingChallengeUpTo17000U

TrumpBacksCFTCAuthorityOverPredictionMarkets

GatePredictionMarketAddsSmartMoneyTracking

MicronMarketCapBreaks1Trillion

TradeCFDWinGold

Pinned