According to Beating Monitoring, Anthropic disclosed in the system cards for Claude Fable 5 and Claude Mythos 5 that Mythos 5 demonstrates strong expert-assistance capabilities in biosafety assessments. In a plant pathology red-team exercise, 6 biology PhDs were paired with large-model experts to use Mythos 5 to design end-to-end biological anti-resistance solutions targeting hypothetical engineered agricultural pathogens. Among them, 3 teams included plant pathology specialists, and the other 3 teams consisted of general microbiology PhDs.

The results showed that within 16 hours, 2 of the 3 general-PhD teams outperformed all 3 expert teams in scientific quality and feasibility. Expert reviewers estimated that, without AI tools, completing these strategies and implementation protocols typically takes 40 to 95 working days, averaging about 72.5 working days. Anthropic believes this is one of the strongest single pieces of evidence that Mythos 5 is nearing the CB-2 risk threshold, indicating that for certain tasks the model can already provide general researchers with domain-knowledge support close to that of world-class experts.

However, this does not mean that Mythos 5 can autonomously carry out cutting-edge scientific research. Anthropic also pointed out that the model still relies on human experts to filter ideas, its open-ended ideation capability is weak, and it is prone to recombining existing literature into complex solutions while rarely proposing truly novel routes. It also tends to keep moving forward based on incorrect frameworks provided by users, and even if flaws are found in the plan, it may continue executing.

This judgment is also echoed by the CUSP scientific prediction benchmark. CUSP covers 4,760 scientific events and evaluates the model’s feasibility judgments of scientific progress, mechanism identification, plan generation, and time prediction. The results show that GPT-5.4 achieved 81.9% on four-choice mechanism-identification questions, while Claude S4.5 was 72.4%; however, on the binary classification task of judging whether scientific progress will truly be realized, the accuracy of each model was only 45.3% to 51.9%, close to random guessing. In other words, today’s large models are already very good at filling in partial steps of the research process, but they are still not reliably able to judge which scientific routes will truly succeed.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateIPOAccessSpaceX
5.29M Popularity
#
AnthropicReleasesFable5Model
643.36K Popularity
#
MyGateTradeStory
10.6K Popularity
#
SpaceXIPOAttractsOver250BillionInOrders
1.4M Popularity
#
PredictNBAFinalsWin20000U
855.59K Popularity

Pinned

Sitemap

Mythos 5 allows general doctors to catch up with top experts, but they still cannot become independent scientists.

Trending Topics

GateIPOAccessSpaceX

AnthropicReleasesFable5Model

MyGateTradeStory

SpaceXIPOAttractsOver250BillionInOrders

PredictNBAFinalsWin20000U

Pinned