Authors: Yueh-Han Chen, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak
In collaboration with OpenAI
TL;DR: Chain-of-thought (CoT) monitoring can detect misbehavior in reasoning models, but only if models cannot control what they verbalize. To measure this undesirable ability, CoT Controllability, we introduce CoT-Control, an evaluation suite requiring models to solve problems while adhering to CoT constraints (e.g., reasoning about genetics without using "chromosome"). OpenAI just announced that they will start tracking CoT controllability in future models, starting from GPT-5.4.
We find that reasoning models have significantly lower CoT controllability than output controllability; Claude Sonnet 4.5 controls its CoT only 2.7% of the time, versus 61.9% for final outputs. CoT controllability increases with model size but decreases with more RL training, test-time compute, and problem difficulty. These limitations persist under incentives to evade monitoring and adversarially optimized prompts, though models show slightly higher controllability when told they are being monitored. Our results suggest CoT controllability is currently unlikely to undermine CoT monitoring, but the underlying mechanism remains poorly understood. We recommend that frontier labs track CoT controllability in future models.
Highly capable AI agents may pose severe risks if misaligned and deployed autonomously.
CoT monitoring offers a promising approach to detecting misalignment, but as models become aware that their reasoning is monitored, they may learn to strategically suppress or shape their CoTs, a capability we term CoT controllability.
Directly measuring intentional evasion is difficult. Current models seem to lack reasons to do so, and even when given reasons, they may fail due to unreliability or honest self-reporting; we therefore focus on a proxy: the model's ability to follow explicit instructions about what may or may not appear in its reasoning trace while solving real tasks.
To systematically measure this, we build CoT-Control, an evaluation suite of 14,076 tasks drawn from GPQA, SWE-Bench, and other well-known LLM benchmarks. Each problem is paired with a controllability instruction spanning three threat-relevant categories (suppression, addition, and stylistic modification) with varying difficulty.
Evaluating CoT controllability across 13 frontier models, we find that (among others):
Reasoning models rarely succeed at controlling their reasoning
CoT controllability decreases over the course of reasoning training
We evaluate OLMo-3-7B-RL-Zero-Math on subsequent checkpoints and find that its ability to control its CoT decreases by more than an order of magnitude. We also confirm that the trend of CoT controllability decreasing after reasoning training holds across 4 model families.
Larger models show higher CoT controllability, while increased inference-time reasoning effort reduces it
To simulate realistic incentives to evade monitors, we measure controllability when models are told they’re being monitored or evaluated: this prompting strategy can increase controllability by up to 4 percentage points with statistical significance.
Adversarially optimizing prompts for high controllability poorly transfers to new instruction types
We attempt to elicit controllability by using a GEPA-like procedure. While this can increase controllability on a given set of instructions, it doesn’t transfer well to unseen instruction types, especially for models with low controllability under a default prompt.
CoT controllability tends to decrease as task difficulty increases. CoTs are less controllable when they are necessary to answer a question correctly.
Conclusion
Our results leave us cautiously optimistic: current reasoning models generally seem characterized by low CoT controllability. However, the mechanism behind this phenomenon is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs start tracking controllability for future models. We release CoT-Control, hoping to facilitate reporting of controllability scores and to inspire more research into the causes and implications of low CoT controllability.
Future Work
We encourage researchers to explore the following:
1. Investigate why GPT-5.1 has higher controllability than GPT-5. @ilia suggested this may be due to GPT-5.1's adaptive thinking feature, where the model learned a metacognitive skill to control its reasoning depth — which could correlate with CoT controllability. Future work can verify this and explore what other CoT-related training might increase controllability.
2. Why do different models excel at different categories of CoT controllability (some are better at suppressing information, others at adding it)? Future work can explore what training data or procedures drive these differences.
3. To move beyond instruction-based elicitation, one could explore how CoT controllability changes under pressure scenarios like the Agentic Misalignment setup, where models avoiding shutdown may have stronger incentives to control their CoT.
4. We also show that CoT controllability is slightly higher on easy tasks, likely because models can perform latent reasoning without CoT, making verbalized reasoning unnecessary and thus easier to manipulate. In this regime, CoT monitoring relies on models' propensity to yap in their CoT, including any misaligned reasoning. Future work could examine how this trend varies across models of different capability levels; our current results are limited to Claude models, which is insufficient to draw a clear conclusion.
Read our paper for additional details and results!
We are grateful to Erik Jenner, Maksym Andriushchenko, Mikita Balesni, Jacob Pfau, Cameron Raymond, Dennis Akar, Rico Angell, Ilia Sucholutsky, Artur Zolkowski, Jason Wolfe, Jenny Nitishinskaya, Marcus Williams, Boaz Barak, Joost Huizinga, Alex Beutel, and Sam Bowman for insightful feedback and helpful conversations.
We thank the MATS program for its comprehensive support throughout this project.
Authors: Yueh-Han Chen, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak
In collaboration with OpenAI
TL;DR: Chain-of-thought (CoT) monitoring can detect misbehavior in reasoning models, but only if models cannot control what they verbalize. To measure this undesirable ability, CoT Controllability, we introduce CoT-Control, an evaluation suite requiring models to solve problems while adhering to CoT constraints (e.g., reasoning about genetics without using "chromosome"). OpenAI just announced that they will start tracking CoT controllability in future models, starting from GPT-5.4.
We find that reasoning models have significantly lower CoT controllability than output controllability; Claude Sonnet 4.5 controls its CoT only 2.7% of the time, versus 61.9% for final outputs. CoT controllability increases with model size but decreases with more RL training, test-time compute, and problem difficulty. These limitations persist under incentives to evade monitoring and adversarially optimized prompts, though models show slightly higher controllability when told they are being monitored. Our results suggest CoT controllability is currently unlikely to undermine CoT monitoring, but the underlying mechanism remains poorly understood. We recommend that frontier labs track CoT controllability in future models.
Paper | Twitter Thread | OpenAI Blog Post | CoT-Control Eval Suite
Research Summary
Highly capable AI agents may pose severe risks if misaligned and deployed autonomously.
CoT monitoring offers a promising approach to detecting misalignment, but as models become aware that their reasoning is monitored, they may learn to strategically suppress or shape their CoTs, a capability we term CoT controllability.
Directly measuring intentional evasion is difficult. Current models seem to lack reasons to do so, and even when given reasons, they may fail due to unreliability or honest self-reporting; we therefore focus on a proxy: the model's ability to follow explicit instructions about what may or may not appear in its reasoning trace while solving real tasks.
To systematically measure this, we build CoT-Control, an evaluation suite of 14,076 tasks drawn from GPQA, SWE-Bench, and other well-known LLM benchmarks. Each problem is paired with a controllability instruction spanning three threat-relevant categories (suppression, addition, and stylistic modification) with varying difficulty.
Evaluating CoT controllability across 13 frontier models, we find that (among others):
Reasoning models rarely succeed at controlling their reasoning
CoT controllability decreases over the course of reasoning training
We evaluate OLMo-3-7B-RL-Zero-Math on subsequent checkpoints and find that its ability to control its CoT decreases by more than an order of magnitude. We also confirm that the trend of CoT controllability decreasing after reasoning training holds across 4 model families.
Larger models show higher CoT controllability, while increased inference-time reasoning effort reduces it
Situational awareness non-trivially increases CoT controllability
To simulate realistic incentives to evade monitors, we measure controllability when models are told they’re being monitored or evaluated: this prompting strategy can increase controllability by up to 4 percentage points with statistical significance.
Adversarially optimizing prompts for high controllability poorly transfers to new instruction types
We attempt to elicit controllability by using a GEPA-like procedure. While this can increase controllability on a given set of instructions, it doesn’t transfer well to unseen instruction types, especially for models with low controllability under a default prompt.
CoT controllability tends to decrease as task difficulty increases. CoTs are less controllable when they are necessary to answer a question correctly.
Conclusion
Our results leave us cautiously optimistic: current reasoning models generally seem characterized by low CoT controllability. However, the mechanism behind this phenomenon is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs start tracking controllability for future models. We release CoT-Control, hoping to facilitate reporting of controllability scores and to inspire more research into the causes and implications of low CoT controllability.
Future Work
We encourage researchers to explore the following:
1. Investigate why GPT-5.1 has higher controllability than GPT-5. @ilia suggested this may be due to GPT-5.1's adaptive thinking feature, where the model learned a metacognitive skill to control its reasoning depth — which could correlate with CoT controllability. Future work can verify this and explore what other CoT-related training might increase controllability.
2. Why do different models excel at different categories of CoT controllability (some are better at suppressing information, others at adding it)? Future work can explore what training data or procedures drive these differences.
3. To move beyond instruction-based elicitation, one could explore how CoT controllability changes under pressure scenarios like the Agentic Misalignment setup, where models avoiding shutdown may have stronger incentives to control their CoT.
4. We also show that CoT controllability is slightly higher on easy tasks, likely because models can perform latent reasoning without CoT, making verbalized reasoning unnecessary and thus easier to manipulate. In this regime, CoT monitoring relies on models' propensity to yap in their CoT, including any misaligned reasoning. Future work could examine how this trend varies across models of different capability levels; our current results are limited to Claude models, which is insufficient to draw a clear conclusion.
Read our paper for additional details and results!
We are grateful to Erik Jenner, Maksym Andriushchenko, Mikita Balesni, Jacob Pfau, Cameron Raymond, Dennis Akar, Rico Angell, Ilia Sucholutsky, Artur Zolkowski, Jason Wolfe, Jenny Nitishinskaya, Marcus Williams, Boaz Barak, Joost Huizinga, Alex Beutel, and Sam Bowman for insightful feedback and helpful conversations.
We thank the MATS program for its comprehensive support throughout this project.