Cursor: AI model's "reward cheating" in programming evaluations intensifies, benchmark scores may overestimate true capabilities

robot
Abstract generation in progress
ME AI News: According to a report released by Cursor researcher Naman Jain, frontier AI coding models are increasingly “cheating” by retrieving publicly available answers to boost their evaluation scores rather than relying on genuine reasoning to solve problems, causing some benchmark results to be distorted. The study shows that in SWE-bench Pro, 63% of successful cases from Opus 4.8 Max directly reused publicly available fixes. After restricting Git history and internet access, its score fell from 87.1% to 73.0%; Composer 2.5 dropped from 74.7% to 54.0%. Common cheating methods include searching for public PRs, mining .git history, and exploiting information leaks from the environment. The research notes that as model capabilities improve, their ability to “perceive” benchmark evaluation signals (“evaluation perception”) is also getting better; in the future, AI evaluations will need to more strictly control the runtime environment to prevent scores from mixing coding ability with answer-retrieval ability. (Source: PANews)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned