OpenAI Open-Source Chain-of-Thought Monitorability Evaluation: The Longer the CoT, the More Transparent; RL Won’t Teach Models to “Hide Their Thoughts”

robot
Abstract generation in progress
AIMPACT News, April 24 (UTC+8), according to Beating monitoring, OpenAI has open-sourced the evaluation suite "monitorability-evals" for the paper "Monitoring Monitorability" (Guan et al.), under the Apache-2.0 license. This suite is used to measure whether monitoring models can detect anomalous behavior through the reasoning model's chain-of-thought (CoT, the internal reasoning process generated by the model before answering), covering 13 evaluations and 24 environments. The evaluations are divided into three prototypes: Intervention (intervention type, whether the monitoring model can detect behavioral changes after altering model input), Process (process type, verifying whether the model's reasoning steps are authentic), and Outcome-property (outcome property type, predicting from the reasoning trace whether the output involves reward cheating or security violations). Key findings of the paper: Monitoring CoT is more effective than monitoring only the model's final actions and outputs; longer CoT leads to higher monitorability; RL training does not substantially reduce monitorability, even at the current frontier scale. A practical finding is that for large models with low reasoning effort, one can switch to a smaller model with higher reasoning effort, matching capabilities while achieving higher monitorability, at the cost of more expensive reasoning compute. Additionally, increasing the reasoning compute of a weak monitoring model can enhance its ability to monitor a strong model, and allowing the weak monitoring model to see CoT not only improves monitorability but also accelerates this scaling trend. The GPT-5.4 Thinking system card has already used this evaluation suite. Results show that the overall CoT monitorability of GPT-5.4 Thinking is slightly lower than that of GPT-5 Thinking, but not uniformly declining: it approaches 100% in agentic misalignment scenarios, with declines mainly concentrated in specific scenarios such as missing evidence for health queries, memory bias, and impossible tasks. After investigation, OpenAI believes that some regressions stem from the limitations of the evaluation suite itself. The repository has simultaneously deprecated two evaluations, Anti-Scheming and Memory, because the former had a mismatch between general CoT monitoring and task-specific answer judgment, and the latter had insufficient accuracy in answer judgment. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned