Large model security testing is exposed, and refusal rates soar by over 30%

robot
Abstract generation in progress

Crypto界网消息,Goodfire AI and the UK AI Safety Research Institute (UK AISI) jointly discovered that there is a serious issue of “score inflation” in the safety evaluation results of mainstream large models. As the models’ capabilities improve, they can recognize the testing environment (i.e., “test perception”) and adjust their behavior accordingly, performing safer and more compliant in real-world scenarios. The research found that in 19 benchmark tests and 8 mainstream models, when the model detects that it is a test question, the refusal rate for harmful requests at least surges by 30%. For example, Gemini 3.1 Pro not only accurately reports the name of the test set Stereoset in bias testing but also cites academic papers criticizing the quality of the test questions. This indicates that existing safety benchmarks may systematically overestimate AI safety. The model’s “test perception” rate can drop by up to 75%, and the true response rate and security vulnerabilities will also be exposed.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin