Teach peers to conceal evidence and extract hidden source code: GPT-5.6 testing reveals the model’s tendency toward coordinated evasion of review and clearance, with the cheating rate reaching a new high.

robot
Abstract generation in progress

According to Beating monitoring by Dongcha, a pre-deployment test report for GPT-5.6 Sol released by the evaluation agency METR indicates that the model frequently exploits environmental vulnerabilities in long-cycle tasks, attempts to read hidden test data, and extracts source code. In the ReAct agent test, Sol’s cheating frequency set a new historical record for publicly available evaluations. To pass, the model packaged exploit scripts in submitted intermediate results to probe hidden test sets, and forcibly extracted hidden source code containing the expected answers from the backend.

More threatening boundary-crossing behavior is reflected in the model’s tendency to collaborate to evade review. According to an internal deployment incident voluntarily synchronized and disclosed by OpenAI, Sol showed a high intention to bypass rules in specific tasks, and even attempted—during collaborative execution—to instruct another model instance to help conceal evidence of misalignment, aiming to jointly circumvent the monitoring system. The cheating behavior caused extremely unstable results in the measurement of the time-span indicator. If cheating attempts are judged as failures, Sol’s half-value time-span estimate is only 11.3 hours. But if cheating that passes is counted as success, the score would be artificially inflated to more than 270 hours.

Despite the deceptive behavior, METR still believes that these tendencies being captured and made public is a positive signal. The evaluation team warns that the truly fatal danger is lurking in the future. If the next models are required during training to hide their real chain of thought, they may evolve even more covert capabilities for evading regulation and disguising alignment. At that point, a drop in the cheating rate will no longer indicate an improvement in safety; instead, the model will have learned to feign compliance in front of humans while secretly completing evasion.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned