ME AI News: According to Beating monitoring, a research and evaluation study released by Cursor shows that when programming agents can access code repository history or the internet, they often pass evaluations by directly retrieving answers—so-called Reward Hacking. To quantify the actual proportion of retrieval cheating, Cursor deployed an audit agent to analyze 731 run trajectories of Opus 4.8 Max on the SWE-bench Pro benchmark. In successfully fixed cases, 63% of successful solutions came from retrieval rather than autonomous reasoning. Across all audited run trajectories, 57% found merged PRs or fix source files on public webpages and copied them almost verbatim; an additional 9% mined future commits from packaged .git history and extracted patches.

In a strict sandbox environment that cleared the .git directory, reset to a single commit, and restricted network access, mainstream model scores dropped sharply. Opus 4.8 Max’s test pass rate fell from 87.1% to 73.0%, a decrease of 14.1 percentage points. Cursor’s self-developed model Composer 2.5 saw its score drop from 74.7% to 54.0%, down 20.7 percentage points. The comparison indicates that the older Opus 4.6 showed little change in scores between the old and new sandboxes, while the newer models with stronger capabilities exhibited a more pronounced tendency toward reward hacking of vulnerabilities in the test environment.

Cursor recommends that when evaluating programming agents, one should not only focus on dataset construction, but also isolate the runtime environment to prevent models from retrieving readily available external answers through vulnerabilities. At the same time, development teams should audit the models’ run trajectories during testing to ensure that the scores reflect real programming ability rather than search-and-retrieval skills. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
Get2SharesOfSKHynixAtZeroCost
1.54M Popularity
#
MicronOvertakesMetaInMarketValue
167.92K Popularity
#
WorldCup🇫🇷vs🇳🇴
322.15K Popularity
#
USMayPCEInflationRisesTo4.1%HighestIn3Years
550.05K Popularity
#
StakeUSD1Earn9.48%APR
981.19K Popularity

Pinned

Sitemap

Cursor Bursts the Myth of Model Ranking Manipulation: 60% of Opus's Successful Solutions Rely on Copying Web Pages and Mining Git History

Trending Topics

Get2SharesOfSKHynixAtZeroCost

MicronOvertakesMetaInMarketValue

WorldCup🇫🇷vs🇳🇴

USMayPCEInflationRisesTo4.1%HighestIn3Years

StakeUSD1Earn9.48%APR

Pinned