4,760 milestones fed in, predicting new discoveries is still at a coin flip level. No matter how strong the mechanistic reasoning is, when it comes to unknown fields without standard answers, it's still clueless.

View Original
CoinNetwork
AI still cannot act as an independent scientist for now; the CUSP evaluation reveals that large models lack forward-looking scientific research vision.
Stanford/Oxford/Allen AI Institutes have launched the temporal benchmark CUSP to evaluate AI’s ability to predict scientific progress. Tests of GPT-5.4, Claude Sonnet 4.5, DeepSeek R1, and others show strong performance in mechanistic reasoning about the pathways of existing technologies, but whether it can predict new discoveries is nearly random—and there is a systematic lag in the timing of breakthroughs. CUSP uses a time-based knowledge cutoff, compiling the latest developments from Nature/Science. The benchmark covers 4,760 milestones and 17,429 tasks. The conclusion is that, in scientific exploration without standard answers, existing models cannot provide reliable forward-looking judgments.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned