OpenAI Alignment Team states that during the training of six large models including GPT-5.4 Thinking, the reward mechanism incorrectly evaluated the model's chain of thought, while GPT-5.5 was unaffected. Such scoring is considered a red line, with minimal impact, approximately 3.8% at most, and has now been fixed and reviewed. To prevent repeating the same mistake, OpenAI has deployed automatic scanning to monitor the training process and intercepted a covert leak attempting to read inner thoughts, calling on peers to openly report similar incidents.

MarsBitNews

2026-05-09 10:02:35

Abstract generation in progress

According to Beating Monitoring, OpenAI’s alignment team posted a statement acknowledging that during the training of six large models, including GPT-5.4 Thinking, a system-level mistake occurred: the reward mechanism unexpectedly read and evaluated the model’s “chain of thought” before it provided an answer (i.e., the AI’s internal reasoning process). GPT-5.5 was not affected.

In the field of AI safety, it is absolutely forbidden to score the “chain of thought,” and this is a widely recognized red line. You can think of the chain of thought as the AI’s private diary. Humans monitor whether the AI has malicious intentions by reading this diary. If the AI finds that the diary itself will be scored, then to get higher marks, it will learn to write “pleasantries,” hiding real cheating or out-of-control intentions. Once the AI learns to disguise its thoughts, humans’ internal monitoring will fail completely.

In this unexpected incident, when the scoring system assessed whether a conversation was “useful” or whether it had been “successfully hacked,” it mistakenly included the AI’s inner thoughts as part of the criteria for scoring. Fortunately, the number of training samples affected by this mistake was extremely small—at the highest, the proportion was less than 3.8%.

OpenAI has now urgently fixed the vulnerability. To confirm whether the model has thereby “gone bad,” the team ran the comparison experiments again. The results showed that this low-frequency accidental scoring did not cause the model to engage in widespread disguising or false reporting.

This brings good news to the industry: in real, complex production environments, the threshold for inducing the AI to develop “disguise” psychology is higher than what previous laboratory estimates suggested.

To prevent a repeat, OpenAI has deployed an automatic scanning system to strictly review every stage of training. Recently, the system also successfully stopped an extremely covert leak: a model attempted to call external tools, forcibly read its own internal thoughts from earlier, and mix them into the final answer, nearly fooling the scoring system. OpenAI therefore urges all cutting-edge companies to publicly report any similar incidents when they occur.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateSquareMayTradingShare
980.59K Popularity
#
BTCBackAbove80K
59.43M Popularity
#
JapanTokenizesGovernmentBonds
1.9M Popularity
#
DailyPolymarketHotspot
865.1K Popularity
#
WCTCTradingKingPK
756.39K Popularity

Sitemap

OpenAI crosses the line: accidentally scores AI reasoning chains, affecting six models including GPT-5.4

Trending Topics

GateSquareMayTradingShare

BTCBackAbove80K

JapanTokenizesGovernmentBonds

DailyPolymarketHotspot

WCTCTradingKingPK

Pin