Reasoning Models Struggle to Control Their Chains of Thought

Yueh Han "John" Chen; robert mccarthy; Bruce W. Lee; Tomek Korbak

Authors: Yueh-Han Chen, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak

In collaboration with OpenAI

TL;DR: Chain-of-thought (CoT) monitoring can detect misbehavior in reasoning models, but only if models cannot control what they verbalize. To measure this undesirable ability, CoT Controllability, we introduce CoT-Control, an evaluation suite requiring models to solve problems while adhering to CoT constraints (e.g., reasoning about genetics without using "chromosome"). OpenAI just announced that they will start tracking CoT controllability in future models, starting from GPT-5.4.

We find that reasoning models have significantly lower CoT controllability than output controllability; Claude Sonnet 4.5 controls its CoT only 2.7% of the time, versus 61.9% for final outputs. CoT controllability increases with model size but decreases with more RL training, test-time compute, and problem difficulty. These limitations persist under incentives to evade monitoring and adversarially optimized prompts, though models show slightly higher controllability when told they are being monitored. Our results suggest CoT controllability is currently unlikely to undermine CoT monitoring, but the underlying mechanism remains poorly understood. We recommend that frontier labs track CoT controllability in future models.

Paper | Twitter Thread | OpenAI Blog Post | CoT-Control Eval Suite

Research Summary

Highly capable AI agents may pose severe risks if misaligned and deployed autonomously.

CoT monitoring offers a promising approach to detecting misalignment, but as models become aware that their reasoning is monitored, they may learn to strategically suppress or shape their CoTs, a capability we term CoT controllability.

Directly measuring intentional evasion is difficult. Current models seem to lack reasons to do so, and even when given reasons, they may fail due to unreliability or honest self-reporting; we therefore focus on a proxy: the model's ability to follow explicit instructions about what may or may not appear in its reasoning trace while solving real tasks.

To systematically measure this, we build CoT-Control, an evaluation suite of 14,076 tasks drawn from GPQA, SWE-Bench, and other well-known LLM benchmarks. Each problem is paired with a controllability instruction spanning three threat-relevant categories (suppression, addition, and stylistic modification) with varying difficulty.

Evaluating CoT controllability across 13 frontier models, we find that (among others):

Reasoning models rarely succeed at controlling their reasoning

CoT controllability decreases over the course of reasoning training

We evaluate OLMo-3-7B-RL-Zero-Math on subsequent checkpoints and find that its ability to control its CoT decreases by more than an order of magnitude. We also confirm that the trend of CoT controllability decreasing after reasoning training holds across 4 model families.

Larger models show higher CoT controllability, while increased inference-time reasoning effort reduces it

Situational awareness non-trivially increases CoT controllability

To simulate realistic incentives to evade monitors, we measure controllability when models are told they’re being monitored or evaluated: this prompting strategy can increase controllability by up to 4 percentage points with statistical significance.

Adversarially optimizing prompts for high controllability poorly transfers to new instruction types

We attempt to elicit controllability by using a GEPA-like procedure. While this can increase controllability on a given set of instructions, it doesn’t transfer well to unseen instruction types, especially for models with low controllability under a default prompt.

CoT controllability tends to decrease as task difficulty increases. CoTs are less controllable when they are necessary to answer a question correctly.

Conclusion

Our results leave us cautiously optimistic: current reasoning models generally seem characterized by low CoT controllability. However, the mechanism behind this phenomenon is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs start tracking controllability for future models. We release CoT-Control, hoping to facilitate reporting of controllability scores and to inspire more research into the causes and implications of low CoT controllability.

Future Work

We encourage researchers to explore the following:

1. Investigate why GPT-5.1 has higher controllability than GPT-5. @ilia suggested this may be due to GPT-5.1's adaptive thinking feature, where the model learned a metacognitive skill to control its reasoning depth — which could correlate with CoT controllability. Future work can verify this and explore what other CoT-related training might increase controllability.

2. Why do different models excel at different categories of CoT controllability (some are better at suppressing information, others at adding it)? Future work can explore what training data or procedures drive these differences.

3. To move beyond instruction-based elicitation, one could explore how CoT controllability changes under pressure scenarios like the Agentic Misalignment setup, where models avoiding shutdown may have stronger incentives to control their CoT.

4. We also show that CoT controllability is slightly higher on easy tasks, likely because models can perform latent reasoning without CoT, making verbalized reasoning unnecessary and thus easier to manipulate. In this regime, CoT monitoring relies on models' propensity to yap in their CoT, including any misaligned reasoning. Future work could examine how this trend varies across models of different capability levels; our current results are limited to Claude models, which is insufficient to draw a clear conclusion.

Read our paper for additional details and results!

We are grateful to Erik Jenner, Maksym Andriushchenko, Mikita Balesni, Jacob Pfau, Cameron Raymond, Dennis Akar, Rico Angell, Ilia Sucholutsky, Artur Zolkowski, Jason Wolfe, Jenny Nitishinskaya, Marcus Williams, Boaz Barak, Joost Huizinga, Alex Beutel, and Sam Bowman for insightful feedback and helpful conversations.

We thank the MATS program for its comprehensive support throughout this project.

Naively I'd have thought success on something like SHADE-Arena, where models attempt to covertly complete a side-task while evading monitors, is a closer proxy to the threat model being described. Claude models are also improving rapidly there (0.8% for Sonnet 4.5 -> 17.6% for Opus 4.6), would be curious to see the results on your eval suite for these models.

Yeah I agree that an control eval setup would be nice! The main reason we focus on purely CoT instruction following (IF) is that current models lack motivation to pursue harmful side tasks, and even when given reasons, they may fail due to unreliability or honest self-reporting. Our eval bypasses this by focusing on CoT IF, assuming that IF training generalizes from outputs to CoTs in a way that is representative of how controlling CoTs in pursuit of misaligned goals might generalize.

Additionally, our CoT controllability tasks are very simple, e.g., not using a certain word or thinking only in uppercase, so if models cannot even do that, it makes a stronger case that they are unlikely to covertly pursue side tasks, which would require more complex control to bypass CoT monitors.

(Note: here I'm talking more about side tasks that usually would require tool calls. If side tasks are pure computational, like doing math, then CoT controllability eval will not be able to catcth those latent reasoning failure case )

I can't help but notice that, just looking at the Claude models, later/more capable models are worse at this. Is there a chance this could be at least partly sandbagging?

Sonnet 4.5's system card says this in the 7.5.10.2 section of research sandbagging: "It was only able to do this, though, with the advantage of unmonitored hidden reasoning, and without this artificial crutch, it is no more effective than prior models and appears to pose little risk in this setting."

So I think it's unlikely that later Claude sonnet models are worse at coT controllability is due to sandbagging. I hypothesize is more like due to (1) more RLVR steps, as we shown that will decrease CoT controllability, and (2) potentially earlier Claude models like Claude sonnet 3.7 were put under optimization pressure on their CoT, which is why its controllability is so high, but the more recent models didn't get put under pressure on CoT. Again, these are my hypotheses.

This is neat! I'm curious how you think this would transfer to other types of "controllability". Like, I agree that it is useful to know that models cannot control for specific lexical constraints... but the thing I'd be more worried about is the models' ability to "steer" its chain of thought to excuse a specific behavior. For example, suppose a model has monitors acting on it to prevent it from being schemy, but that it learns to bypass these monitors by controlling its chain of thought towards reasonable excuses for a specific behavior, thereby instrumentally arriving at the ability to undertake the behavior. This is akin to some of the thoughts in this post.

I'd assume the CoT controllability types you mention, like persuasion or reframing, likely require much more cognitive effort to do so, and thus, are harder, compared to simply not using a word or thinking in uppercase. Though, one could empirically verify this additionally, in our work, we also see that harder controllability types are also harder to execute.

I'm currently working on something similar in the context of CoT obfuscation, and have written up some early results here. These results were reasonably contrived in that the task studied is not very reasoning-heavy and we gave the model lots of help with documents describing viable strategies to be able to avoid a monitor (future work addressing these), but I do think it serves as good evidence that models have the capability to steer their CoT to meet a constraint. Intuitively, I do agree with Yueh Han that the CoT would become less malleable as the task becomes more reasoning-heavy to solve.

The fact that Claude models have higher CoT controllability is consistent with recent discussion about Anthropic models not strongly distinguishing between CoT and outputs, and hence reinforcement spillover being more likely.

(Although it strikes me now that the causality between reinforcement spillover and not strongly distinguishing between CoT and outputs could go in either direction).

That's interesting! But I should note that the more recent Claude models have lower CoT controllability. I hypothesize is more like due to (1) more RLVR steps, as we shown that will decrease CoT controllability, and (2) potentially earlier Claude models like Claude sonnet 3.7 were put under optimization pressure on their CoT, which is why its controllability is so high, but the more recent models didn't get put under pressure on CoT. Again, these are my hypotheses.

I can't help but notice that, just looking at the Claude models, later/more capable models are worse at this. Is there a chance this could be at least partly sandbagging?

(Although it strikes me now that the causality between reinforcement spillover and not strongly distinguishing between CoT and outputs could go in either direction).