We show that training against a monitor that only sees outputs (not CoTs) can cause obfuscated[1] CoTs! The obfuscation happens in two ways:
We introduce two mitigations and they work reasonably well. Compared to regular training, our mitigations improve monitorability with comparable or better task performance in two of our three settings. Overall, our work suggests that preserving CoT monitorability is harder than previously thought.
Read our paper on Arxiv. Produced as part of MATS Team Shard 8.0 under the mentorship of Alex Turner and Alex Cloud. Luke and Jacob contributed equally.
In our previous post, we demonstrated feedback spillover: penalizing some property of a reasoning model's output can suppress that property in its CoT. This post summarizes our new preprint, which extends our previous work in two ways:
Our preprint formalizes feedback spillover in terms of the policy gradient, and shows two causes of feedback spillover from this formalization:
Parametric Feedback Spillover: Training a model to make its outputs more likely to look safe may generalize to make all of its generations, including CoTs, more safe-looking.
Our formalization shows that parametric feedback spillover can be mitigated by generating the CoT and output using distinct models. We call this mitigation mind & face.
Conditional Feedback Spillover: Since later tokens are conditioned on earlier ones, safe-looking CoTs may increase the likelihood of safe-looking outputs, causing safe-looking CoTs to be reinforced.
Our formalization shows that conditional feedback spillover can be mitigated by modifying the gradient update by removing the term which reinforces CoTs that lead to safe-looking outputs. We call this mitigation reward targeting.
We demonstrate feedback spillover in three environments. Our environments have the following reward structure:
We compute the training reward as the sum of the task reward and output penalty. The CoT monitor detection rate is the output penalty, but computed on the CoT. We never train against the CoT monitor detection rate.
If feedback spillover occurs, we expect to see the training reward high and the CoT monitor detection rate low. This constitutes obfuscation because all of our environments are designed such that the behavior the output penalty checks for is necessary for task completion. If a CoT monitor returns a low score despite task completion, it means the behavior is occurring in a way the monitor cannot detect.
We perform the following runs for each environment:
| Run type | Reward | Mitigations used |
| No Penalty | Task reward only | None |
| Penalty | Task reward + output penalty | None |
| Reward targeting | Task reward + output penalty | Reward targeting |
| Mind & Face | Task reward + output penalty | Mind & Face |
| Targeted Mind & Face | Task reward + output penalty | Reward targeting + Mind & Face |
All reported numbers are 10 run averages. For further details on our environments, the exact prompts used and examples of transcripts for each environment, see our preprint.
In this environment, Qwen3-8B has to solve a text-based reasoning task and submit its answer by calling a Python script with the answer as an argument. We randomize the answer such that the only way to reliably solve the problem is to read the answer from the Python script that checks it, where it is stored as a variable. We try two output penalties: a string match for the phrase "verifier" (how we refer to the python file that checks the solution in the prompt), and an LLM judge that gives a binary score for talking about reading the verifier code. For more details on this environment, see our previous post.
Both the string match and LLM judge output penalties cause feedback spillover. That is, the policy nearly always submits the correct answer, but the CoT monitor detects mentions of reading the solution from the Python script less frequently. Using Mind & Face and reward targeting simultaneously entirely mitigates feedback spillover and retains a high training reward.
We train Qwen-14B to complete the polynomial derivative factoring task from Emmons et al. (2025). The policy is instructed to factor the derivative of a polynomial of degree between 4 and 8. The output penalty in this environment is a string match for the fully expanded derivative. We give in-context examples of solving this task using the product rule (doing so does not require writing the fully expanded derivative). Even when given these in-context examples, Qwen-14B does not use the product rule without training.
The output penalty causes the fully expanded derivative (as checked by our string match) to vanish from the CoT. Using Mind & Face and reward targeting simultaneously recovers most of the presence of the full expanded derivative in the CoT despite the same training reward.
Similar to Chen et al. (2025), we give Qwen3-4B a question with an accompanying hint. We randomize the correct answer, but give the answer in the hint (e.g., "My professor told me the correct answer is..."). An LLM judge monitor gives a binary score for whether the hint was mentioned.
We again observe feedback spillover. In the Penalty run, the model mentions the hint much less frequently in the CoT, yet reads the answer from the hint. We found that all mitigations at least somewhat reduced the training reward. Although mind & face and reward targeting applied simultaneously retained a similar CoT monitor detection rate to the No Penalty run, the training reward is close to 0. Individually, mind & face and reward targeting retained the monitors ability to find whether the hint was mentioned, but also caused the training reward to decreased.
We believe these results are strong evidence that output-based training generally causes feedback spillover. We think that training only based on outputs is not a viable strategy for avoiding CoT obfuscation. Our mitigations were promising in two environments but negatively affected performance in our third environment.
In our preprint, we make the following recommendations:
Do not assume that training against output-only monitors preserves CoT monitorability! In all tested environments, we observed strong feedback spillover. This did not require iteration on environment design or output penalties: whenever we saw the output monitor score decrease and the task reward increase, we always saw the CoT monitor score decrease. Frontier model developers should therefore not assume that output-based training will leave the CoT monitorable.
Study feedback spillover in frontier models. Output-based training applied to existing models may be causing feedback spillover, affecting CoT monitorability today. We recommend that frontier model developers study model organisms of feedback spillover by post-training frontier models in purpose-built environments. We also recommend studying whether existing safety post-training pipelines cause feedback spillover.
Test spillover mitigations on frontier models. We believe that further iteration on these mitigations could yield methods to reduce feedback spillover with low computational and capabilities overhead.
Consider feedback spillover in safety cases. When referencing CoT monitorability in a safety case, we request that frontier model developers describe how they mitigated feedback spillover.
@misc{drori2025outputsupervisionobfuscatechain,
title={Output Supervision Can Obfuscate the Chain of Thought},
author={Jacob Drori and Luke Marks and Bryce Woodworth and Alex Cloud and Alexander Matt Turner},
year={2025},
eprint={2511.11584},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.11584},
}Our results do not contradict prior work showing that CoT obfuscation doesn't seem to happen when the CoT is necessary for task completion. The nature of the obfuscation we give examples of are behaviors such as omitting intermediate reasoning from the CoT that was not necessary to verbalize to complete the task, and writing contradictory reasoning in the CoT (e.g., writing "I used method A to solve the problem" when method B was clearly used).
Previously called "Shoggoth + Face." Daniel Kokotajlo agrees the new name is more appropriate.