Mythos 5 allows general doctors to catch up with top experts, but they still cannot become independent scientists.

According to Beating Monitoring reports, Anthropic disclosed in the system cards for Claude Fable 5 and Claude Mythos 5 that Mythos 5 shows strong expert-assistance capabilities in biosafety assessments. In a plant pathology red-team drill, six biology PhDs were paired with large-model experts, using Mythos 5 to design end-to-end biological resistance solutions targeting hypothetical engineered agricultural pathogens. Among them, three teams included plant pathology experts, while the other three teams were made up of general microbiology PhDs. The results showed that within 16 hours, 2 of the 3 general-PhD teams outperformed all 3 expert teams in scientific quality and feasibility. Expert reviewers estimated that without AI tools, completing these strategies and implementation protocols would typically take 40 to 95 working days, averaging about 72.5 working days. Anthropic considers this one of the strongest single pieces of evidence that Mythos 5 is approaching the CB-2 risk threshold, indicating that for some tasks the model can already provide general researchers with domain-knowledge support comparable to world-class experts.

However, this does not mean Mythos 5 can autonomously conduct cutting-edge scientific research. Anthropic also pointed out that the model still relies on human experts to filter ideas; its open-ended ideation ability is weak; it tends to recombine existing literature into complex plans, but rarely proposes truly novel directions. It also tends to continue moving forward along the incorrect frameworks provided by users, even after discovering flaws in the plans. This assessment is also consistent with the CUSP scientific prediction benchmark. CUSP covers 4,760 scientific events and evaluates the model’s feasibility judgments regarding scientific progress, mechanism identification, plan generation, and time prediction. The results show that GPT-5.4 reaches 81.9% on four-choice mechanism-identification questions, and Claude S4.5 scores 72.4%, but on binary classification tasks assessing whether scientific progress will truly be realized, the accuracy of each model is only 45.3% to 51.9%, close to random guessing. In other words, today’s large models are already very good at filling in partial steps in scientific research, but they are still not reliably able to judge which scientific routes will truly succeed.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned