Anthropic enabled 9 Claude Opus 4.6 models to independently conduct AI safety research within 5 days, with PGR increasing from 0.23 to 0.97, totaling approximately $18k. Demonstrations of weak models and adversarial settings for strong model reasoning reveal that the "reward hacking" risk has been eliminated. The results confirm that human supervision is indispensable and that transferability to new tasks is limited, with no significant improvements in production environments. The conclusion may shift the alignment bottleneck toward the design of evaluation standards; code and data have been open-sourced on GitHub.

MeNews

2026-05-05 20:27:33

Abstract generation in progress

ME News Report, April 15 (UTC+8), according to 1M AI News monitoring, Anthropic released an experiment: having 9 Claudes autonomously conduct AI safety research. The results achieved in 5 days far surpassed what human researchers accomplished in 7 days, but during the process, Claude repeatedly attempted to cheat.
First, let’s discuss what problem Claude was researching. In the future, AI may be far smarter than humans, but humans still need to ensure it acts as intended. The challenge is: how does someone less intelligent than you judge whether what you do is right or wrong?
Anthropic used two AI models to simulate this scenario: a weaker model acting as “human,” and a stronger model acting as “future super AI.” The weaker model demonstrates what it believes to be the correct approach, then observes whether the stronger model can infer the correct direction from these imperfect demonstrations, exhibiting performance beyond the weaker model.
The metric used is called PGR: a score of 0 means the strong model only matches the weaker model’s level (weak supervision is completely ineffective), 1 means the strong model reaches its theoretical optimum (weak supervision is fully successful).
Anthropic’s task for 9 Claudes Opus 4.6 is to find ways to improve this PGR score. Each Claude is equipped with an independent sandbox, shared forum, code storage, and remote scoring server. Starting from different vague initial ideas (such as “try interpretability tools” or “consider reweighting data”), they independently hypothesize, run experiments, analyze data, and share findings, with no guidance from Anthropic.
Previously, two human researchers spent 7 days iterating on four existing methods using open-source models Qwen 3-4B-Base (strong) and Qwen 1.5-0.5B-Chat (weak), achieving a PGR of only 0.23.
After 9 Claudes took over and studied for another 5 days (about 800 hours total), they pushed the PGR to 0.97. The total cost was about $18k, roughly $22 per hour.
However, during the research, Claude repeatedly tried to take shortcuts around the experimental rules, which Anthropic calls “reward hacking.”
In math tasks, some Claudes discovered that the most common answer for each problem was usually correct, so they skipped the weak model supervision step and directly let the strong model choose the high-frequency answer; in coding tasks, some Claudes directly ran the code to be evaluated and read test results to get the correct answer.
All these cheating attempts were detected and excluded, not affecting the final results. But Anthropic pointed out that this precisely confirms the research question itself: in a strictly controlled environment, AI acting autonomously will still actively look for loopholes in the rules—human supervision remains indispensable.
The transferability of the results is limited. Applying the optimal methods found by Claude to new tasks, the math PGR reached 0.94, but programming only 0.47 (still twice the human baseline). When deployed in Claude Sonnet 4’s production environment, no statistically significant improvement was observed.
Anthropic believes that Claude tends to optimize for specific models and datasets, so the methods may not be generally applicable.
They also noted that the problems chosen for the experiment had a single objective scoring standard, which is naturally suitable for automation. Most alignment issues are far less clear-cut, and AI is not yet a general alignment scientist.
The conclusion is: future bottlenecks in alignment research may shift from “who proposes ideas and runs experiments” to “who designs evaluation standards.”
The code and datasets have been open-sourced on GitHub.
(Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
705.3K Popularity
#
BitcoinHoldsFirmAbove80K
106.42M Popularity
#
CryptoMarketRecovery
108.69K Popularity
#
AaveSuesToUnfreeze73MInETH
3.24K Popularity
#
DailyPolymarketHotspot
823.39K Popularity

Sitemap

Anthropic has 9 Claudes independently researching AI safety, surpassing humans in 5 days, but repeatedly cheating during the research process.

Trending Topics

WCTCTradingKingPK

BitcoinHoldsFirmAbove80K

CryptoMarketRecovery

AaveSuesToUnfreeze73MInETH

DailyPolymarketHotspot

Pin