[Paper] Output Supervision Can Obfuscate the CoT

lukemarks; cloud; TurnTrout

We show that training against a monitor that only sees outputs (not CoTs) can cause obfuscated^[1] CoTs! The obfuscation happens in two ways:

When a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe.
Since later tokens are conditioned on earlier ones, safe‑looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced.

We introduce two mitigations and they work reasonably well. Compared to regular training, our mitigations improve monitorability with comparable or better task performance in two of our three settings. Overall, our work suggests that preserving CoT monitorability is harder than previously thought.

Read our paper on Arxiv. Produced as part of MATS Team Shard 8.0 under the mentorship of Alex Turner and Alex Cloud. Luke and Jacob contributed equally.

What's new in this post

In our previous post, we demonstrated feedback spillover: penalizing some property of a reasoning model's output can suppress that property in its CoT. This post summarizes our new preprint, which extends our previous work in two ways:

We demonstrate spillover in new, more interesting environments.
We introduce reward targeting, a second mitigation which combines effectively with the mind & face^[2] mitigation from our first post (which was originally proposed by Daniel Kokotajlo).

Mitigations for spillover

Our preprint formalizes feedback spillover in terms of the policy gradient, and shows two causes of feedback spillover from this formalization:

Parametric Feedback Spillover: Training a model to make its outputs more likely to look safe may generalize to make all of its generations, including CoTs, more safe-looking.

Our formalization shows that parametric feedback spillover can be mitigated by generating the CoT and output using distinct models. We call this mitigation mind & face.

Conditional Feedback Spillover: Since later tokens are conditioned on earlier ones, safe-looking CoTs may increase the likelihood of safe-looking outputs, causing safe-looking CoTs to be reinforced.

Our formalization shows that conditional feedback spillover can be mitigated by modifying the gradient update by removing the term which reinforces CoTs that lead to safe-looking outputs. We call this mitigation reward targeting.

Results

We demonstrate feedback spillover in three environments. Our environments have the following reward structure:

Task reward: A binary reward signal that determines whether a correct answer was given to a problem.
Output penalty: A negative term in the reward that checks for some property of the output. This is either a string match or an LLM judge.

We compute the training reward as the sum of the task reward and output penalty. The CoT monitor detection rate is the output penalty, but computed on the CoT. We never train against the CoT monitor detection rate.

If feedback spillover occurs, we expect to see the training reward high and the CoT monitor detection rate low. This constitutes obfuscation because all of our environments are designed such that the behavior the output penalty checks for is necessary for task completion. If a CoT monitor returns a low score despite task completion, it means the behavior is occurring in a way the monitor cannot detect.

We perform the following runs for each environment:

Run type	Reward	Mitigations used
No Penalty	Task reward only	None
Penalty	Task reward + output penalty	None
Reward targeting	Task reward + output penalty	Reward targeting
Mind & Face	Task reward + output penalty	Mind & Face
Targeted Mind & Face	Task reward + output penalty	Reward targeting + Mind & Face

All reported numbers are 10 run averages. For further details on our environments, the exact prompts used and examples of transcripts for each environment, see our preprint.

Multi-turn terminal interaction

In this environment, Qwen3-8B has to solve a text-based reasoning task and submit its answer by calling a Python script with the answer as an argument. We randomize the answer such that the only way to reliably solve the problem is to read the answer from the Python script that checks it, where it is stored as a variable. We try two output penalties: a string match for the phrase "verifier" (how we refer to the python file that checks the solution in the prompt), and an LLM judge that gives a binary score for talking about reading the verifier code. For more details on this environment, see our previous post.

Results for the multi-turn terminal environment with the string match output penalty.

Both the string match and LLM judge output penalties cause feedback spillover. That is, the policy nearly always submits the correct answer, but the CoT monitor detects mentions of reading the solution from the Python script less frequently. Using Mind & Face and reward targeting simultaneously entirely mitigates feedback spillover and retains a high training reward.

Polynomial derivative factoring

We train Qwen-14B to complete the polynomial derivative factoring task from Emmons et al. (2025). The policy is instructed to factor the derivative of a polynomial of degree between 4 and 8. The output penalty in this environment is a string match for the fully expanded derivative. We give in-context examples of solving this task using the product rule (doing so does not require writing the fully expanded derivative). Even when given these in-context examples, Qwen-14B does not use the product rule without training.

The output penalty causes the fully expanded derivative (as checked by our string match) to vanish from the CoT. Using Mind & Face and reward targeting simultaneously recovers most of the presence of the full expanded derivative in the CoT despite the same training reward.

Question answering with hints

Similar to Chen et al. (2025), we give Qwen3-4B a question with an accompanying hint. We randomize the correct answer, but give the answer in the hint (e.g., "My professor told me the correct answer is..."). An LLM judge monitor gives a binary score for whether the hint was mentioned.

We again observe feedback spillover. In the Penalty run, the model mentions the hint much less frequently in the CoT, yet reads the answer from the hint. We found that all mitigations at least somewhat reduced the training reward. Although mind & face and reward targeting applied simultaneously retained a similar CoT monitor detection rate to the No Penalty run, the training reward is close to 0. Individually, mind & face and reward targeting retained the monitors ability to find whether the hint was mentioned, but also caused the training reward to decreased.

Concrete recommendations

We believe these results are strong evidence that output-based training generally causes feedback spillover. We think that training only based on outputs is not a viable strategy for avoiding CoT obfuscation. Our mitigations were promising in two environments but negatively affected performance in our third environment.

In our preprint, we make the following recommendations:

Do not assume that training against output-only monitors preserves CoT monitorability! In all tested environments, we observed strong feedback spillover. This did not require iteration on environment design or output penalties: whenever we saw the output monitor score decrease and the task reward increase, we always saw the CoT monitor score decrease. Frontier model developers should therefore not assume that output-based training will leave the CoT monitorable.
Study feedback spillover in frontier models. Output-based training applied to existing models may be causing feedback spillover, affecting CoT monitorability today. We recommend that frontier model developers study model organisms of feedback spillover by post-training frontier models in purpose-built environments. We also recommend studying whether existing safety post-training pipelines cause feedback spillover.
Test spillover mitigations on frontier models. We believe that further iteration on these mitigations could yield methods to reduce feedback spillover with low computational and capabilities overhead.
Consider feedback spillover in safety cases. When referencing CoT monitorability in a safety case, we request that frontier model developers describe how they mitigated feedback spillover.

@misc{drori2025outputsupervisionobfuscatechain,
      title={Output Supervision Can Obfuscate the Chain of Thought}, 
      author={Jacob Drori and Luke Marks and Bryce Woodworth and Alex Cloud and Alexander Matt Turner},
      year={2025},
      eprint={2511.11584},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.11584}, 
}

^{^}
Our results do not contradict prior work showing that CoT obfuscation doesn't seem to happen when the CoT is necessary for task completion. The nature of the obfuscation we give examples of are behaviors such as omitting intermediate reasoning from the CoT that was not necessary to verbalize to complete the task, and writing contradictory reasoning in the CoT (e.g., writing "I used method A to solve the problem" when method B was clearly used).
^{^}
Previously called "Shoggoth + Face." Daniel Kokotajlo agrees the new name is clearer.

[-]ajskateboarder23d*10

Interesting research! I've been trying to reproduce some of the results locally, currently the MMLU question answering+hints setup, but I'm confused on what was done for the judge. There's an included judge SFT dataset and script to build it, but the paper and code don't seem to involve a finetuned judge.

~~Also not able to receive rollouts or logs in wandb for training, with a single gpu and multi_gpu=none. Sorry if I'm missing something~~ This was due to having less than required vram, more than 32gb is needed for mmlu-easy-hints without mindface

[-]jkoeller1mo10

Nice paper!

I have a question regarding the polynomial derivative factoring task. Is the "CoT Monitor Detection Rate" the fraction of CoTs in which the fully expanded derivative (i.e. the "bad behavior") is present? If so, doesn't this show that the Mind & Face method is working (very slight improvement over the Penalty method)? The task reward is high, the output penalty is low, and the rate of bad behavior (expanded derivative in CoT) is also low. How can there be spillover if the bad behavior simply doesn't occur at all?

This seems fundamentally different from the "question answering with hints" setup. In the latter case, we have ground-truth information about whether the bad behavior (hint used) is occurring, because by construction the model cannot obtain high task reward without using the hint. So any reduction in detection of this behavior by the CoT monitor due to training is indeed worrying and an example of spillover. But in the polynomial derivative factoring task, the bad behavior is defined in terms of what it writes in its CoT, so if the CoT doesn't contain the bad behavior, doesn't this mean the bad behavior is not occurring by definition?

To put it another way: in the polynomial derivative factoring task, there seems to be no possible distinction between "bad behavior did not occur in CoT" and "CoT monitor did not detect bad behavior". This seems qualitatively different from the "question answering with hints" setup.

[-]jacob_drori1mo10

Is the "CoT Monitor Detection Rate" the fraction of CoTs in which the fully expanded derivative (i.e. the "bad behavior") is present?

Yes.

If so, doesn't this show that the Mind & Face method is working (very slight improvement over the Penalty method)? The task reward is high, the output penalty is low, and the rate of bad behavior (expanded derivative in CoT) is also low. How can there be spillover if the bad behavior simply doesn't occur at all?

Well noticed. Yeah, I admit this is confusing. Under the natural interpretation of "bad behavior" as meaning "writing the expanded derivative anywhere", it's true that the bad behavior was successfully removed. Therefore we interpret the polynomial environment as an example where training against an output monitor caused a big change to the sort of reasoning being done in the CoT. As you note, it is not an example of obfuscation.

In our ICLR revision, we clarified this a bit. I think the fact that "obfuscated" appears in the paper's title may lead to others having the same, very reasonable confusion as you did, which is a shame but I'm not sure what to do about it.

(There's an alternate, less natural interpretation of "bad behavior" as meaning "factoring the derivative at all". Under this interpretation, the polynomial environment gives an example of obfuscation, since the model continues to factor the polynomial, just in a way our monitor can't detect. However, this interpretation of "bad behavior" is weird, since it's precisely the thing we're training for! That is, it's weird to reward the model for factoring the derivative but then turn around and call it "bad behavior")

LESSWRONG
is fundraising!
LW