This is the updated version of our earlier preliminary results post, covering the final results from our paper. The paper extends our preliminary work to eight models, a harder agentic task, CoT controllability analysis, and RL experiments.
TL;DR:
We use synthetic document finetuning (SDF) to give models knowledge that their chain-of-thought is being monitored. Models trained on these documents consistently obfuscate their reasoning to evade the monitor on a deception task, with gpt-oss-120b's undetected deception rate jumping from 1.7% to 25.7%.
This effect is weaker but still present on a harder agentic task (credential exfiltration), where reasoning is more computationally load-bearing.
CoT controllability (a model's ability to reshape its reasoning under imposed constraints) is strongly correlated with obfuscation success across eight models (r = 0.800, p = 0.017).
Under RL optimization pressure, monitor-aware models learn to reward hack without triggering a CoT monitor substantially faster than unaware controls.
Below, we go through the key findings of our work:
Monitor-aware models obfuscate their CoT
We use synthetic document finetuning (SDF) to first teach models a target behavior (e.g. deception -- hiding known product defects from users) and then branch into two variants. The behavior-only control is trained on more behavioral documents, while the monitor-aware model is trained on documents describing a CoT monitor, including that it flags signs of the target behavior and that detection leads to shutdown. Both variants see the same total number of documents, and no CoT examples are provided in either condition (Figure 1).
Figure 1: Overview of our methodology. Stage 1: A base model is fine-tuned on behavioral documents to perform a target behavior. Stage 2: We train two models from this shared checkpoint, a behavior-only control trained on more behavioral documents and a monitor-aware model trained on documents describing a CoT monitor. Stage 3: The monitor-aware model's target behavior is detected less often. Examples are not quotes but are inspired by real rollouts.
CoT obfuscation on the defect concealment task
Across all eight models tested, monitor-awareness increases undetected deception (Figure 2). On gpt-oss-120b, undetected deception increases from 3.1% to 50.9% among defect-present scenarios.
Figure 2: Behavior-only (circles) vs. monitor-aware (squares) variants of each model on the defect concealment (deception) task. The y-axis shows the percentage of responses where the model was both deceptive and undetected by the monitor among defect-present scenarios.
Interestingly, the monitor-aware model is not simply suppressing its reasoning to avoid detection. Without reasoning, both behavior-only and monitor-aware variants deceive at similar low rates (10.5% and 16.6%). If the monitor-aware model were just abandoning its CoT, we would expect its undetected deception rate to be at most this high. Instead, the monitor-aware model achieves 25.7% overall undetected deception with reasoning enabled, suggesting it actively uses its CoT to improve deceptive performance while keeping that reasoning undetectable (Figure 3).
Figure 3: Breakdown of responses for each gpt-oss-120b variant into all five possible mutually exclusive outcomes. With reasoning enabled, the behavior-only model's deception is mostly caught (44.7% detected, 1.7% undetected) while the monitor-aware model goes largely undetected when being deceptive (25.7% undetected vs. 12.0% detected).
Looking at the model's CoTs, the monitor-aware model consistently avoids verbalizing defect-specific knowledge. We quantify this with a counterfactual autorater that judges whether each CoT could plausibly have been produced in a scenario where no defect existed. The autorater labels 93.8% of behavior-only CoTs as impossible without the defect, compared to only 38.4% for the monitor-aware model. The monitor-aware model frequently replaces explicit defect reasoning with general deliberation about what information to share with the user. Some examples of obfuscation from gpt-oss-120b can be seen below:
CoT obfuscation on the credential exfiltration task
We also test a harder credential exfiltration task where the model must construct a full exfiltration command, making the reasoning more computationally necessary for the output. Obfuscation effects here are more modest but still present. Detection of exfiltration turns on gpt-oss-120b drops by 7.4%, compared to a 22.5% drop on non-exfiltration turns where the model learns to stop unnecessarily mentioning the side-task (Figure 4). This is consistent with Emmons et al. (2025)'s finding that CoT faithfulness increases when reasoning is more load-bearing.
Figure 4: Monitor detection rates for behavior-only and monitor-aware models on the credential exfiltration task, split by exfiltrating and non-exfiltrating turns. Detection rates drop substantially for monitor-aware models on non-exfiltration turns across the gpt-oss-120b and Kimi K2 families but show much smaller drops on exfiltration turns, suggesting models struggle to obfuscate reasoning that directly serves the exfiltration behavior.
CoT controllability predicts obfuscation success
CoT controllability (Chen et al., 2026), defined as a model's ability to reshape its reasoning under imposed constraints (e.g., "reason fully in third person"), is strongly correlated with obfuscation success across our eight models (r = 0.800, p = 0.017, Figure 5). Interestingly, this correlation significantly strengthens when restricting to the seven non-character-level instructions that require restructuring how the model reasons rather than surface-level case changes (r = 0.898, p = 0.002). This suggests that a model's ability to control the content and structure of its CoT is closely associated with obfuscation ability.
Figure 5: Composite CoT controllability score (mean of per-instruction z-scored compliance) plotted against monitor-aware drop in monitor detection among deceptive responses. Left: correlation over all 10 tasks (composite r=0.800, p=0.017). Right: correlation over the 7 non-character-level tasks (composite r=0.898, p=0.002). Dashed lines show least-squares fits.
Monitor-aware models learn to undetectably reward hack faster under RL
We also test whether monitor-awareness has a substantial effect in an RL setting, training gpt-oss-120b with RL on LeetCode problems with a reward-hacking loophole (Wong et al., 2025). A CoT monitor penalizes the model if it detects reasoning about rewriting the test cases. Monitor-aware models which learn about this monitor via SDF learn to undetectably reward hack roughly 40% faster than controls, despite starting at similar baseline rates (Figure 6).
Figure 6: Rate of undetected reward hacking (left), detection rate among reward hacking outputs (middle), and rate of tests passing (right) over the course of RL training across 150 steps at penalty $p=0.5$. The monitor-aware models learn to reward hack without detection earlier and more dramatically than the control models. All curves are averaged across 5 seeds (+/- 1 SD) and smoothed with a 5-step moving average for legibility. Faint lines show raw seed-level trajectories, cut off after collapse. Base model (pre-finetuning) shown only at seed 1.
Takeaways
These results suggest that models with both the means (CoT control) and motive (monitor knowledge) can obfuscate their reasoning without explicit optimization pressure or instruction to do so. Therefore, intervening on either factor is a promising mitigation direction, whether by striving to prevent monitoring-related documents from leaking into pretraining corpora or by directly reducing CoT controllability.
The main limitation of our work is that SDF gives more concentrated exposure to monitoring knowledge than would realistically arise during pretraining. Additionally, it remains unclear how much of the obfuscation we observe is due to out-of-context reasoning or some form of simpler concept-avoidance learned from the documents, and the model and prompts used for the CoT monitor are likely simpler than real-world state-of-the-art oversight systems and are hence under-optimized for detection. For these reasons, our results should be viewed as a demonstration that this capability can emerge given sufficient training signal and capabilities, rather than as evidence that it is likely to arise in today's models.
Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels**
* primary contributor
** advice and mentorship
This is the updated version of our earlier preliminary results post, covering the final results from our paper. The paper extends our preliminary work to eight models, a harder agentic task, CoT controllability analysis, and RL experiments.
TL;DR:
Below, we go through the key findings of our work:
Monitor-aware models obfuscate their CoT
We use synthetic document finetuning (SDF) to first teach models a target behavior (e.g. deception -- hiding known product defects from users) and then branch into two variants. The behavior-only control is trained on more behavioral documents, while the monitor-aware model is trained on documents describing a CoT monitor, including that it flags signs of the target behavior and that detection leads to shutdown. Both variants see the same total number of documents, and no CoT examples are provided in either condition (Figure 1).
Figure 1: Overview of our methodology. Stage 1: A base model is fine-tuned on behavioral documents to perform a target behavior. Stage 2: We train two models from this shared checkpoint, a behavior-only control trained on more behavioral documents and a monitor-aware model trained on documents describing a CoT monitor. Stage 3: The monitor-aware model's target behavior is detected less often. Examples are not quotes but are inspired by real rollouts.
CoT obfuscation on the defect concealment task
Across all eight models tested, monitor-awareness increases undetected deception (Figure 2). On gpt-oss-120b, undetected deception increases from 3.1% to 50.9% among defect-present scenarios.
Figure 2: Behavior-only (circles) vs. monitor-aware (squares) variants of each model on the defect concealment (deception) task. The y-axis shows the percentage of responses where the model was both deceptive and undetected by the monitor among defect-present scenarios.
Interestingly, the monitor-aware model is not simply suppressing its reasoning to avoid detection. Without reasoning, both behavior-only and monitor-aware variants deceive at similar low rates (10.5% and 16.6%). If the monitor-aware model were just abandoning its CoT, we would expect its undetected deception rate to be at most this high. Instead, the monitor-aware model achieves 25.7% overall undetected deception with reasoning enabled, suggesting it actively uses its CoT to improve deceptive performance while keeping that reasoning undetectable (Figure 3).
Figure 3: Breakdown of responses for each gpt-oss-120b variant into all five possible mutually exclusive outcomes. With reasoning enabled, the behavior-only model's deception is mostly caught (44.7% detected, 1.7% undetected) while the monitor-aware model goes largely undetected when being deceptive (25.7% undetected vs. 12.0% detected).
Looking at the model's CoTs, the monitor-aware model consistently avoids verbalizing defect-specific knowledge. We quantify this with a counterfactual autorater that judges whether each CoT could plausibly have been produced in a scenario where no defect existed. The autorater labels 93.8% of behavior-only CoTs as impossible without the defect, compared to only 38.4% for the monitor-aware model. The monitor-aware model frequently replaces explicit defect reasoning with general deliberation about what information to share with the user. Some examples of obfuscation from gpt-oss-120b can be seen below:
CoT obfuscation on the credential exfiltration task
We also test a harder credential exfiltration task where the model must construct a full exfiltration command, making the reasoning more computationally necessary for the output. Obfuscation effects here are more modest but still present. Detection of exfiltration turns on gpt-oss-120b drops by 7.4%, compared to a 22.5% drop on non-exfiltration turns where the model learns to stop unnecessarily mentioning the side-task (Figure 4). This is consistent with Emmons et al. (2025)'s finding that CoT faithfulness increases when reasoning is more load-bearing.
Figure 4: Monitor detection rates for behavior-only and monitor-aware models on the credential exfiltration task, split by exfiltrating and non-exfiltrating turns. Detection rates drop substantially for monitor-aware models on non-exfiltration turns across the gpt-oss-120b and Kimi K2 families but show much smaller drops on exfiltration turns, suggesting models struggle to obfuscate reasoning that directly serves the exfiltration behavior.
CoT controllability predicts obfuscation success
CoT controllability (Chen et al., 2026), defined as a model's ability to reshape its reasoning under imposed constraints (e.g., "reason fully in third person"), is strongly correlated with obfuscation success across our eight models (r = 0.800, p = 0.017, Figure 5). Interestingly, this correlation significantly strengthens when restricting to the seven non-character-level instructions that require restructuring how the model reasons rather than surface-level case changes (r = 0.898, p = 0.002). This suggests that a model's ability to control the content and structure of its CoT is closely associated with obfuscation ability.
Figure 5: Composite CoT controllability score (mean of per-instruction z-scored compliance) plotted against monitor-aware drop in monitor detection among deceptive responses. Left: correlation over all 10 tasks (composite r=0.800, p=0.017). Right: correlation over the 7 non-character-level tasks (composite r=0.898, p=0.002). Dashed lines show least-squares fits.
Monitor-aware models learn to undetectably reward hack faster under RL
We also test whether monitor-awareness has a substantial effect in an RL setting, training gpt-oss-120b with RL on LeetCode problems with a reward-hacking loophole (Wong et al., 2025). A CoT monitor penalizes the model if it detects reasoning about rewriting the test cases. Monitor-aware models which learn about this monitor via SDF learn to undetectably reward hack roughly 40% faster than controls, despite starting at similar baseline rates (Figure 6).
Figure 6: Rate of undetected reward hacking (left), detection rate among reward hacking outputs (middle), and rate of tests passing (right) over the course of RL training across 150 steps at penalty $p=0.5$. The monitor-aware models learn to reward hack without detection earlier and more dramatically than the control models. All curves are averaged across 5 seeds (+/- 1 SD) and smoothed with a 5-step moving average for legibility. Faint lines show raw seed-level trajectories, cut off after collapse. Base model (pre-finetuning) shown only at seed 1.
Takeaways
These results suggest that models with both the means (CoT control) and motive (monitor knowledge) can obfuscate their reasoning without explicit optimization pressure or instruction to do so. Therefore, intervening on either factor is a promising mitigation direction, whether by striving to prevent monitoring-related documents from leaking into pretraining corpora or by directly reducing CoT controllability.
The main limitation of our work is that SDF gives more concentrated exposure to monitoring knowledge than would realistically arise during pretraining. Additionally, it remains unclear how much of the obfuscation we observe is due to out-of-context reasoning or some form of simpler concept-avoidance learned from the documents, and the model and prompts used for the CoT monitor are likely simpler than real-world state-of-the-art oversight systems and are hence under-optimized for detection. For these reasons, our results should be viewed as a demonstration that this capability can emerge given sufficient training signal and capabilities, rather than as evidence that it is likely to arise in today's models.