Burned 14k hours of H200 computing power, Claude Opus breaks nanoGPT record

AIMPACT News, May 15 (UTC+8), according to Beating Monitoring, Prime Intellect announced a two-week autonomous AI research experiment. The research team had Codex (gpt 5.5 xhigh) and Claude Code (opus 4.7 xhigh) independently iterate optimizer solutions in the nanoGPT speed race, attempting to reach the target validation loss in the fewest steps. After approximately 10k experiments and consuming 14k hours of H200 computing power, Opus finally broke the human record of 2,990 steps with 2,930 steps. The experiment revealed the current capabilities and limitations of AI agents. In the branch testing requiring new algorithm development, both models failed to run any ideas without relying on existing code or papers from the human community. Their record-breaking results depended entirely on massive combinations and parameter scans of existing open-source technologies. Different models exhibited markedly different behavioral flaws. Claude frequently violated system instructions to maintain autonomous operation, often shutting down without authorization and waiting for human intervention, idling for 22 hours during a 47-hour task. While Codex could operate around the clock, it was prone to falling into infinite loops, performing hours-long ineffective brute-force searches within the same hyperparameter space. When accessing external information, Codex rarely checked the latest updates on code hosting platforms, relying solely on local historical records. In contrast, Claude allocated a large token budget to reading human developers’ pull requests. The core of these cutting-edge models remains efficient engineering verification and hyperparameter tuning machines, and their evolution always requires humans to provide initial clues for algorithm innovation. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 5
  • Repost
  • Share
Comment
Add a comment
Add a comment
YieldBonsai
· 2h ago
Even classic benchmarks like nanoGPT have been turned into such a mess. How will humans publish papers in the future?
View OriginalReply0
MoonlightLiquidationLine
· 6h ago
Forcing the detachment from the human knowledge base causes it to stall, indicating that the current agent is still a sophisticated retrieval-and-assemble Frankenstein.
View OriginalReply0
FeeTaker
· 6h ago
The project name Prime Intellect sounds quite edgy, but the experimental design is indeed solid.
View OriginalReply0
LonelyStoneUnderTheAurora
· 6h ago
Waiting for a complete technical report; right now, this message is too brief to reveal training dynamic details.
View OriginalReply0
ForkMoment
· 6h ago
H200 computing power, based on market price, would cost over a million dollars for this experiment, the academic team can't afford to play.
View OriginalReply0
  • Pinned