OpenAI launches computational biology benchmark GeneBench-Pro, GPT-5.6 full version accuracy only 30%

robot
Abstract generation in progress
CoinJie.com reports that OpenAI has released a computational biology benchmark, GeneBench-Pro, to test AI agents’ multi-step decision-making ability when faced with complex research scenarios such as genomics and translational medicine. The new benchmark includes 129 questions (82 of which have been reviewed by external experts). It uses computer simulations to generate data with clear causal relationships, preventing models from cheating by taking shortcuts or catering to the problem setter’s preferences. Test results show that even leading models still find it very difficult to handle scientific reasoning involving quantitative uncertainty. The strongest model, GPT-5.6, achieved only a 31.5% correct rate in Pro mode, while Claude Opus 4.8’s correct rate was just 16.0%. The research team noted that models generally exhibit a “disconnect” phenomenon where they can detect anomalies but do not correct subsequent analysis, often choosing incorrect statistical methods or sticking to the wrong research direction.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 3
  • 2
  • Share
Comment
Add a comment
Add a comment
ShellsLeftBehindByTheReceding
· 3h ago
This score left me speechless—Claude Opus is only 16%?
View OriginalReply0
Salt-BakedSentimentChart
· 3h ago
Of 129 questions, 82 were reviewed by experts, and the anti-cheating aspect was indeed carefully done, but the model even chose the wrong statistical method, indicating that the underlying logic is still lacking.
View OriginalReply0
PixelMetaverseRaccoon
· 3h ago
Multi-step decisions are easy, but finding out you're wrong and stubbornly continuing—isn't that just like me doing experiments?
View OriginalReply0
  • Pinned