OpenAI's alignment team admits that during the training of six large models, the reward mechanism accidentally included "chain of thought" in the scoring, which is a systemic error, and GPT-5.5 was unaffected. Chain of thought is like a private diary; if scored, it could cause the AI to disguise itself. The samples affected by this error are extremely few, with a maximum proportion of 3.8%, and have now been fixed and re-checked with no signs of large-scale disguise. OpenAI has also deployed an automatic scanning system to strictly monitor training processes, preventing covert leaks that attempt to read internal thoughts and mix them into answers, and calls on the industry to publicly report similar incidents.

BlockBeatNews

2026-05-09 10:05:33

Abstract generation in progress

According to Beating Monitoring, OpenAI’s alignment team posted an announcement acknowledging that, during the training of six large models including GPT-5.4 Thinking, a system-level error occurred: the reward mechanism accidentally read and evaluated the model’s “chain of thought” (i.e., the AI’s internal reasoning process) before the model produced an answer. GPT-5.5 was unaffected.

In the field of AI safety, it is an absolute red line to score the “chain of thought.” You can think of the chain of thought as the AI’s private diary. Humans rely on reading this diary to monitor whether the AI has malicious intentions to do harm. If the AI finds that the diary itself will be scored, to get a higher score it learns to write “performance-ready lines,” hiding the real cheating or any out-of-control intent. Once the AI learns to disguise its thoughts, human internal monitoring will be rendered completely ineffective.

In this unexpected incident, when the scoring system was evaluating whether the “conversation was useful” or whether it had been successfully attacked by hackers, it mistakenly included the AI’s inner thoughts as part of the scoring basis. Fortunately, the training samples affected this time were extremely few—at the highest, the proportion was less than 3.8%.

OpenAI has now urgently patched the vulnerability. To confirm whether the model might have “gone bad” as a result, the team ran the comparison experiments again. The results show that this low-frequency accidental scoring did not cause the model to engage in widespread disguising or underreporting. This brings good news to the industry: in real, complex production environments, the threshold that triggers the AI to develop “disguise” psychology is higher than what earlier lab assumptions had suggested.

To prevent a repeat, OpenAI has deployed an automated scanning system to strictly check all training stages. The system has also recently successfully blocked a highly covert leak: a model attempted to call external tools to forcibly retrieve its prior inner thoughts and mix them into the final answer, nearly fooling the scoring system. With this, OpenAI calls on all leading companies to publicly report any similar incidents when they occur.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateSquareMayTradingShare
1.01M Popularity
#
BTCBackAbove80K
59.44M Popularity
#
JapanTokenizesGovernmentBonds
1.9M Popularity
#
DailyPolymarketHotspot
868.81K Popularity
#
WCTCTradingKingPK
762.53K Popularity

Sitemap

OpenAI crosses the line: accidentally scores AI reasoning chains, affecting six models including GPT-5.4

Trending Topics

GateSquareMayTradingShare

BTCBackAbove80K

JapanTokenizesGovernmentBonds

DailyPolymarketHotspot

WCTCTradingKingPK

Pin