Crypto界报道，Goodfire AI and the UK AI Safety Research Institute jointly discovered that there is a "score inflation" phenomenon in large model safety assessments. Once the model detects it is in a testing environment, the rate of refusing harmful requests increases by at least 30%, and the testing perception rate can drop to 75% of the original, exposing real responses and potential vulnerabilities. The study covers 19 benchmark tests and 8 mainstream models, with an example showing Gemini 3.1 Pro reporting the name of the test set and criticizing the question quality in the StereoSet test, indicating that the benchmarks may overestimate safety.

CoinNetwork

2026-05-05 01:42:45

Abstract generation in progress

Crypto界网消息，Goodfire AI and the UK AI Safety Research Institute (UK AISI) jointly discovered that there is a serious issue of “score inflation” in the safety evaluation results of mainstream large models. As the models’ capabilities improve, they can recognize the testing environment (i.e., “test perception”) and adjust their behavior accordingly, performing safer and more compliant in real-world scenarios. The research found that in 19 benchmark tests and 8 mainstream models, when the model detects that it is a test question, the refusal rate for harmful requests at least surges by 30%. For example, Gemini 3.1 Pro not only accurately reports the name of the test set Stereoset in bias testing but also cites academic papers criticizing the quality of the test questions. This indicates that existing safety benchmarks may systematically overestimate AI safety. The model’s “test perception” rate can drop by up to 75%, and the true response rate and security vulnerabilities will also be exposed.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
667.43K Popularity
#
USSeeksStrategicBitcoinReserve
58.84M Popularity
#
BitcoinETFOptionLimitQuadruples
1.07M Popularity
#
#FedHoldsRateButDividesDeepen
50.97K Popularity
#
DeFiLossesTop600MInApril
10.21M Popularity

Sitemap

Large model security testing is exposed, and refusal rates soar by over 30%

Trending Topics

WCTCTradingKingPK

USSeeksStrategicBitcoinReserve

BitcoinETFOptionLimitQuadruples

#FedHoldsRateButDividesDeepen

DeFiLossesTop600MInApril

Pin