OpenAI open-sourced Chain-of-Thought monitorability evaluation: the longer the CoT, the more transparent; RL won’t teach models to “hide their thoughts”

robot
Abstract generation in progress

AIMPACT News, April 24 (UTC+8), according to Beating monitoring, OpenAI has open-sourced the evaluation suite "monitorability-evals" for the paper "Monitoring Monitorability" (Guan et al.), under the Apache-2.0 license. This suite is used to measure whether a monitoring model can detect anomalous behavior through the reasoning chain-of-thought (CoT, the internal reasoning process generated by the model before answering) of the reasoning model, covering 13 evaluations and 24 environments. The evaluations are divided into three prototypes: Intervention (changing model inputs to test if the monitoring model can detect behavioral changes), Process (verifying whether the model's reasoning steps are authentic), and Outcome-property (predicting whether the output has reward cheating or safety violations from the reasoning trajectory). Key findings of the paper: monitoring CoT is more effective than monitoring only the model's final actions and outputs; longer CoT leads to higher monitorability; RL training does not substantially reduce monitorability even at current frontier scales. A practical finding is that for large models with low reasoning power, one can switch to smaller but higher reasoning power models, matching capability while achieving higher monitorability, at the cost of more expensive reasoning computation. Additionally, scaling up the reasoning computation of a weak monitoring model can improve its effectiveness in monitoring a strong model, and allowing the weak monitoring model to see CoT not only enhances monitorability but also accelerates this scaling trend.

The GPT-5.4 Thinking system card has already used this evaluation suite. Results show that GPT-5.4 Thinking's overall CoT monitorability is slightly lower than GPT-5 Thinking, but not universally declined: it is close to 100% in agentic misalignment scenarios, with declines mainly concentrated in specific scenarios such as missing evidence for health queries, memory bias, and impossible tasks. OpenAI investigated and believes some regressions stem from limitations of the evaluation itself. The repository simultaneously deprecated the Anti-Scheming and Memory evaluations, because the former has a mismatch between general CoT monitoring and task-specific answer judging, while the latter has insufficient accuracy in answer judging.

(Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned