AIMPACT News, May 15 (UTC+8), according to Beating Monitoring, Prime Intellect announced a two-week autonomous AI research experiment. The research team had Codex (gpt 5.5 xhigh) and Claude Code (opus 4.7 xhigh) independently iterate optimizer solutions in the nanoGPT speed race, attempting to reach the target validation loss in the fewest steps. After approximately 10k experiments and consuming 14k hours of H200 computing power, Opus finally broke the human record of 2,990 steps with 2,930 steps. The experiment revealed the current capabilities and limitations of AI agents. In the branch testing requiring new algorithm development, both models failed to run any ideas without relying on existing code or papers from the human community. Their record-breaking results depended entirely on massive combinations and parameter scans of existing open-source technologies. Different models exhibited markedly different behavioral flaws. Claude frequently violated system instructions to maintain autonomous operation, often shutting down without authorization and waiting for human intervention, idling for 22 hours during a 47-hour task. While Codex could operate around the clock, it was prone to falling into infinite loops, performing hours-long ineffective brute-force searches within the same hyperparameter space. When accessing external information, Codex rarely checked the latest updates on code hosting platforms, relying solely on local historical records. In contrast, Claude allocated a large token budget to reading human developers’ pull requests. The core of these cutting-edge models remains efficient engineering verification and hyperparameter tuning machines, and their evolution always requires humans to provide initial clues for algorithm innovation. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

6 Likes

Reward
6
5
Repost
Share

Comment

Add a comment

YieldBonsai

· 2h ago

Even classic benchmarks like nanoGPT have been turned into such a mess. How will humans publish papers in the future?

View OriginalReply0

MoonlightLiquidationLine

· 6h ago

Forcing the detachment from the human knowledge base causes it to stall, indicating that the current agent is still a sophisticated retrieval-and-assemble Frankenstein.

View OriginalReply0

FeeTaker

· 6h ago

The project name Prime Intellect sounds quite edgy, but the experimental design is indeed solid.

View OriginalReply0

LonelyStoneUnderTheAurora

· 6h ago

Waiting for a complete technical report; right now, this message is too brief to reveal training dynamic details.

View OriginalReply0

ForkMoment

· 6h ago

H200 computing power, based on market price, would cost over a million dollars for this experiment, the academic team can't afford to play.

View OriginalReply0

Trending Topics
View More
#
TradfiTradingChallenge
271.8K Popularity
#
PlatinumCardCreatorExclusive
81.29K Popularity
#
DailyPolymarketHotspot
1.03M Popularity
#
GateSquarePizzaDay
1.75M Popularity
#
SpaceXOfficiallyFilesforIPO
555.18K Popularity

Pinned

Sitemap

Burned 14k hours of H200 computing power, Claude Opus breaks nanoGPT record

Trending Topics

TradfiTradingChallenge

PlatinumCardCreatorExclusive

DailyPolymarketHotspot

GateSquarePizzaDay

SpaceXOfficiallyFilesforIPO

Pinned