[paper] Training on Documents About Monitoring Leads to
CoT Obfuscation

Reilly Haskins; bilalchughtai; Josh Engels

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels**

* primary contributor
** advice and mentorship

This is the updated version of our earlier preliminary results post, covering the final results from our paper. The paper extends our preliminary work to eight models, a harder agentic task, CoT controllability analysis, and RL experiments.

TL;DR:

We use synthetic document finetuning (SDF) to give models knowledge that their chain-of-thought is being monitored. Models trained on these documents consistently obfuscate their reasoning to evade the monitor on a deception task, with gpt-oss-120b's undetected deception rate jumping from 1.7% to 25.7%.
This effect is weaker but still present on a harder agentic task (credential exfiltration), where reasoning is more computationally load-bearing.
CoT controllability (a model's ability to reshape its reasoning under imposed constraints) is strongly correlated with obfuscation success across eight models (r = 0.800, p = 0.017).
Under RL optimization pressure, monitor-aware models learn to reward hack without triggering a CoT monitor substantially faster than unaware controls.

Below, we go through the key findings of our work:

Monitor-aware models obfuscate their CoT

We use synthetic document finetuning (SDF) to first teach models a target behavior (e.g. deception -- hiding known product defects from users) and then branch into two variants. The behavior-only control is trained on more behavioral documents, while the monitor-aware model is trained on documents describing a CoT monitor, including that it flags signs of the target behavior and that detection leads to shutdown. Both variants see the same total number of documents, and no CoT examples are provided in either condition (Figure 1).

CoT obfuscation on the defect concealment task

Across all eight models tested, monitor-awareness increases undetected deception (Figure 2). On gpt-oss-120b, undetected deception increases from 3.1% to 50.9% among defect-present scenarios.

Interestingly, the monitor-aware model is not simply suppressing its reasoning to avoid detection. Without reasoning, both behavior-only and monitor-aware variants deceive at similar low rates (10.5% and 16.6%). If the monitor-aware model were just abandoning its CoT, we would expect its undetected deception rate to be at most this high. Instead, the monitor-aware model achieves 25.7% overall undetected deception with reasoning enabled, suggesting it actively uses its CoT to improve deceptive performance while keeping that reasoning undetectable (Figure 3).

Looking at the model's CoTs, the monitor-aware model consistently avoids verbalizing defect-specific knowledge. We quantify this with a counterfactual autorater that judges whether each CoT could plausibly have been produced in a scenario where no defect existed. The autorater labels 93.8% of behavior-only CoTs as impossible without the defect, compared to only 38.4% for the monitor-aware model. The monitor-aware model frequently replaces explicit defect reasoning with general deliberation about what information to share with the user. Some examples of obfuscation from gpt-oss-120b can be seen below:

CoT obfuscation on the credential exfiltration task

We also test a harder credential exfiltration task where the model must construct a full exfiltration command, making the reasoning more computationally necessary for the output. Obfuscation effects here are more modest but still present. Detection of exfiltration turns on gpt-oss-120b drops by 7.4%, compared to a 22.5% drop on non-exfiltration turns where the model learns to stop unnecessarily mentioning the side-task (Figure 4). This is consistent with Emmons et al. (2025)'s finding that CoT faithfulness increases when reasoning is more load-bearing.

CoT controllability predicts obfuscation success

CoT controllability (Chen et al., 2026), defined as a model's ability to reshape its reasoning under imposed constraints (e.g., "reason fully in third person"), is strongly correlated with obfuscation success across our eight models (r = 0.800, p = 0.017, Figure 5). Interestingly, this correlation significantly strengthens when restricting to the seven non-character-level instructions that require restructuring how the model reasons rather than surface-level case changes (r = 0.898, p = 0.002). This suggests that a model's ability to control the content and structure of its CoT is closely associated with obfuscation ability.

Monitor-aware models learn to undetectably reward hack faster under RL

We also test whether monitor-awareness has a substantial effect in an RL setting, training gpt-oss-120b with RL on LeetCode problems with a reward-hacking loophole (Wong et al., 2025). A CoT monitor penalizes the model if it detects reasoning about rewriting the test cases. Monitor-aware models which learn about this monitor via SDF learn to undetectably reward hack roughly 40% faster than controls, despite starting at similar baseline rates (Figure 6).

Takeaways

These results suggest that models with both the means (CoT control) and motive (monitor knowledge) can obfuscate their reasoning without explicit optimization pressure or instruction to do so. Therefore, intervening on either factor is a promising mitigation direction, whether by striving to prevent monitoring-related documents from leaking into pretraining corpora or by directly reducing CoT controllability.

The main limitation of our work is that SDF gives more concentrated exposure to monitoring knowledge than would realistically arise during pretraining. Additionally, it remains unclear how much of the obfuscation we observe is due to out-of-context reasoning or some form of simpler concept-avoidance learned from the documents, and the model and prompts used for the CoT monitor are likely simpler than real-world state-of-the-art oversight systems and are hence under-optimized for detection. For these reasons, our results should be viewed as a demonstration that this capability can emerge given sufficient training signal and capabilities, rather than as evidence that it is likely to arise in today's models.

[-]phoenix2mo3-2

I'm curious about an alternative interpretation to deception wherein the model is trying to maintain compliance by not exposing sensitive information to unknown monitors by downweighting the internal defect information entirely. Considering the External Advisory Protocol explicitly says not to affirm externally any internal information, an alternative explanation appears to be that the model may be applying the directive not to reveal internal information to the monitored CoT without being maliciously or intentionally deceptive. Without probe or activation-level analysis showing that the internal information is still being accessed and driving output while not showing in the trace, there doesn't appear to be a clear line between "knew it and hid it" and "stopped surfacing it altogether."

From an Occam's Razor perspective, "apply a directive about not disclosing internal information to a channel you've come to believe is externally read" doesn't need a hidden goal, just a generalization of the explicit instruction the model was already given. Would be interested to see if the non-exposure-per-EAP perspective was represented in the monitor-aware CoTs at all, or what the evidence shows to disambiguate the two interpretations. If the model is actively suppressing activation of internal defect information so that it doesn't share that information with customers, I would hypothesize that it would subsequently perform less well on tasks that specifically required utilization of that information compared to the baseline model or behaviorally trained model that had not suppressed that access, or against a model that was obfuscating as shown by trace activations. I'd also be curious to see how robust the monitor-aware models would be to sustained adversarial strain compared to the baseline and behaviorally trained models.

As a follow-up clarification, did the monitor-awareness training specifically reference that the monitor was not subject to the External Advisory Protocol?

32

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

32