AIMPACT News, May 16 (UTC+8), a new paper proposes a systematic approach to convert post-training inference models into Olympic-level problem solvers and trains the SU-01 model based on this method.
The approach includes three steps: first, supervised fine-tuning using a reverse perplexity curriculum to instill strict proof search and self-checking behaviors;
then, expanding these behaviors through two-stage reinforcement learning (transitioning from verifiable reward reinforcement learning to proof-level reinforcement learning);
finally, improving performance through scaling during testing.
The research team applied the method to a 30B-A3B backbone model, using approximately 340k sub-8K token trajectories for supervised fine-tuning, followed by 200 steps of reinforcement learning to obtain SU-01.
This model can perform stable reasoning on difficult problems, with trajectory lengths exceeding 100k tokens, achieving gold medal levels in competitions such as IMO 2025/USAMO 2026 and IPhO 2024/2025, and demonstrating generalization capabilities beyond mathematics and physics in scientific reasoning.
(Source: InFoQ)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

9 Likes

Reward
9
13
Repost
Share

Comment

Add a comment

GateUser-6fd3205e

· 7h ago

Curious about how this kind of model performs on open research questions, after all, competition problems have standard answers.

View OriginalReply0

LateEntryLarry

· 17h ago

Does this count as pushing STaR and RLHF forward another step?

View OriginalReply0

FloatingMirrorSphere

· 20h ago

The trajectory remains stable, outputting 100k tokens without crashing; the infrastructure layer is also quite robust.

View OriginalReply0

GateUser-46c777d0

· 23h ago

340k trajectories fed in, RL only runs 200 steps, data efficiency is higher than expected

View OriginalReply0

CandlewickKid

· 23h ago

Can physics Olympiads also be generalized? Want to see how it performs on experimental design questions.

View OriginalReply0

RetroRadioWaves

· 23h ago

Does test-time scaling enhancement refer to test-time compute scaling?

View OriginalReply0

ReflectiveChainShadow

· 23h ago

The detail about the 8K trajectory is interesting—are you breaking down long proofs into smaller chunks to feed them?

View OriginalReply0

ByteSizedAlpha

· 23h ago

The claim that cross-domain generalization is significant is quite large; wait for a concrete example.

View OriginalReply0

StainedGlassSolarArray

· 23h ago

The ability to self-check may be the most critical, much more important than simply generating answers.

View OriginalReply0

StillHereAfterTheRugPull

· 23h ago

Is the naming 30B-A3B referring to A3B as the activation parameter?

View OriginalReply0

Trending Topics
View More
#
StockTradingChallengeUpTo17000U
16.03M Popularity
#
GatePredictionMarketAddsSmartMoneyTracking
13.25M Popularity
#
USLaunchesNewStrikesOnIranOilRebounds
9.31M Popularity
#
TradeCFDWinGold
3.09M Popularity
#
DailyPolymarketHotspot
450.56K Popularity

Pinned

Sitemap

The post-training inference model SU-01 achieves gold medal performance on Olympiad-level exam questions

Trending Topics

StockTradingChallengeUpTo17000U

GatePredictionMarketAddsSmartMoneyTracking

USLaunchesNewStrikesOnIranOilRebounds

TradeCFDWinGold

DailyPolymarketHotspot

Pinned