Mode collapse warning: RL rewards have increased but the ceiling hasn't moved; this point needs to be emphasized.

View Original
MeNews
Stanford NLP Team Demonstrates New Advances in Automated AI Research
Stanford NLP demonstrated at ICML 2026 that by automating executors, pre-training and post-training of LLMs can be transformed into execution environments, using execution feedback to improve research efficiency. Two methods: evolutionary search outperforms GRPO (69.4% vs. 48.0%) in post-training tasks, and recipes found in pre-training tasks are faster than nanoGPT (19.7 minutes vs. 35.9 minutes), both completed within ten search rounds; reinforcement learning based on execution rewards is prone to mode collapse, which, while increasing average rewards, does not improve the upper bound. This work points the way for execution-oriented automated AI research.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned