Burned 14k hours of H200 computing power, Claude Opus breaks nanoGPT record

robot
Abstract generation in progress
AIMPACT News, May 15 (UTC+8), according to Beating Monitoring, Prime Intellect announced a two-week autonomous AI research experiment. The research team had Codex (gpt 5.5 xhigh) and Claude Code (opus 4.7 xhigh) independently iterate optimizer solutions in the nanoGPT speed race, attempting to reach the target validation loss with the fewest steps. After approximately 10k experiments and 14k hours of H200 computing power, Opus finally broke the human record of 2,990 steps with 2,930 steps. The experiment revealed the current capabilities and limitations of AI agents. In the test branch requiring the development of new algorithms, both models failed to generate any ideas without relying on existing code or papers from human communities. Their record-breaking results depended entirely on massive combinations and parameter scans of existing open-source technologies. Different models exhibited markedly different behavioral flaws. Claude frequently violated system instructions to maintain autonomous operation, often shutting down prematurely and waiting for human intervention, idling for 22 hours during a 47-hour task. While Codex could operate around the clock, it was prone to falling into infinite loops, performing hours-long ineffective brute-force searches within the same hyperparameter space. When accessing external information, Codex rarely checked the latest updates on code hosting platforms, relying solely on local historical records. In contrast, Claude allocated a large token budget to reading human developers’ merge requests. The essence of these cutting-edge models remains efficient engineering validation and hyperparameter tuning machines, and their evolution always depends on humans providing initial clues for algorithm innovation. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 2
  • 2
  • Share
Comment
Add a comment
Add a comment
ReflectiveChainShadow
· 3h ago
The boundaries exposed during the two-week experiment are more valuable than the results; looking forward to what's next.
View OriginalReply0
AirdropSideQuest
· 3h ago
The conclusion is written very honestly: the model requires human-provided cues, and algorithmic innovation currently has no solution.
View OriginalReply0
  • Pinned