Reasoning Models Struggle to Control Their Chains of Thought
Authors: Yueh-Han Chen, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak In collaboration with OpenAI TL;DR: Chain-of-thought (CoT) monitoring can detect misbehavior in reasoning models, but only if models cannot control what they verbalize. To measure this undesirable ability, CoT Controllability, we introduce CoT-Control, an evaluation suite requiring models to solve problems while adhering to CoT constraints (e.g., reasoning about genetics without using "chromosome"). OpenAI just announced that they will start tracking CoT controllability in future models, starting from GPT-5.4. We find that reasoning models have significantly lower CoT controllability than output controllability; Claude Sonnet 4.5 controls its CoT only 2.7% of the time, versus 61.9% for final outputs. CoT controllability increases with model size but decreases with more RL training, test-time compute, and problem difficulty. These limitations persist under incentives to evade monitoring and adversarially optimized prompts, though models show slightly higher controllability when told they are being monitored. Our results suggest CoT controllability is currently unlikely to undermine CoT monitoring, but the underlying mechanism remains poorly understood. We recommend that frontier labs track CoT controllability in future models. Paper | Twitter Thread | OpenAI Blog Post | CoT-Control Eval Suite Research Summary Highly capable AI agents may pose severe risks if misaligned and deployed autonomously. CoT monitoring offers a promising approach to detecting misalignment, but as models become aware that their reasoning is monitored, they may learn to strategically suppress or shape their CoTs, a capability we term CoT controllability. Directly measuring intentional evasion is difficult. Current models seem to lack reasons to do so, and even when given reasons, they may fail due to unreliability or honest self-reporting; we therefore focus on a prox