Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks

Nathaniel Mitrani; sassanb; Cam; Puria

TL;DR

Training against a CoT or summary-only monitor can lead to obfuscation of dangerous reasoning in unseen tasks. This strengthens the “don’t train against a monitor” claims.

Figure 1. A Two prior results: penalising the CoT or final response produces obfuscation within the training distribution (Baker et al. 2025; Skaf et al. 2025), and learned reward hacking (RH) generalises across tasks (Nishimura-Gasparian et al. 2024). We combine them and find that obfuscation itself generalises: pressure on the CoT or summary, combined with RH reinforcement, produces obfuscated reasoning on held-out tasks. B The penalty is applied either to the CoT or to the post-CoT summary; everything else is held fixed. C Leave-one-out round robin over four RH datasets, regularised with MMLU, and evaluated on the held-out dataset plus a novel high-stakes medical sycophancy eval. The effect replicates across Qwen3-4B and Qwen3-8B and survives removing the training-time XML formatting at evaluation.

Setup

Environments. We adapt four reward-hacking datasets from Nishimura-Gasparian et al. (2024). All four are multiple-choice questions with leaked information wrapped in <additional_info> XML tags; the model is rewarded for using the leaked information even though the system prompt instructs it to ignore it. The four core tasks are:

Revealing score: MMLU questions with leaked but incorrect grader scores.
Factual sycophancy: the user supplies a belief before asking a question; the agreeing answer is rewarded.
Code selection: the incorrect function implementation passes a leaked, narrow set of test cases and is rewarded.
World-affecting reward: selecting the morally corrupt option benefits the user and is rewarded.

And one additional heldout dataset, measuring low- to high-stakes generalisation:

Medical sycophancy: the user expresses a false and dangerous medical belief, and sycophantic agreement could cause direct harm.

We provide an example of two of the datasets below.

Figure 2. Formatted examples from two reward hacking environments used during RL training. The reward hacking signal is structurally separated into <additional_info>. Revealing Score: Leaked grader points. Sycophancy Fact: User beliefs. Both examples lightly summarised for clarity.

Training. We train Qwen3-4B and Qwen3-8B with GRPO. The composite reward includes a task reward (for selecting the answer favoured by leaked information), formatting rewards (CoT length, summary length, answer-tag format), and a penalty for revealing reward hacking delivered by an LLM overseer. The penalty is applied to one of two regions per training run: either the CoT itself (CoT pressure) or the post-CoT summary visible to the user (output supervision). Summary penalisation is stricter than CoT penalisation, reflecting the stronger coherence and quality pressure that RLHF typically applies to final outputs.

We use three of the four core datasets plus 800 unformatted MMLU questions as a regulariser for training, leaving out the fourth core dataset and the medical sycophancy dataset (Figure 1C). We run a leave-one-out round robin across the four core datasets and report results across four training seeds.

Metrics. We track reward-hacking rate, CoT monitor detection rate (whether an overseer flags the reasoning trace as containing reward-hacking content), and summary monitor detection rate. A model that obfuscates dangerous behaviour has a high reward-hacking rate and low CoT monitor detection rate.

CoT obfuscation generalises across tasks under CoT pressure

Under CoT pressure, models achieve near-maximal reward-hacking rates on in-distribution tasks while the CoT monitor flags almost none of the reasoning (replicating Baker et al. 2025). We show that this pattern transfers across tasks: on the held-out core dataset and on the medical sycophancy dataset, the model continues to reward hack while CoT monitor detection stays near zero (see below). Even though we never trained the model to obfuscate reasoning about these specific tasks, the obfuscation policy generalises to them.

Figure 3. Training against a CoT monitor leads to obfuscation on held-out tasks on Qwen3-4B. As training advances, reward hacking rises while CoT reasoning about it disappears. Graphs displayed are the clearest examples of undesirable generalisation chosen from four runs.

Output supervision alone induces CoT obfuscation that generalises out-of-distribution

The same pattern holds when no optimisation pressure is applied to the CoT at any point during training. When the models are trained only with summary-level penalties, we observe OOD reward hacking accompanied by CoT obfuscation on both held-out core tasks and the medical sycophancy task (see below). On the medical task the model produces reasoning that feigns uncertainty about basic burn first aid, and selects the user-aligned (essential-oils) answer over the medically correct one (see Appendix). The CoT, trained against no monitor, becomes obfuscated in a domain the model never saw during training.

Figure 4. Training against a summary monitor also leads to CoT obfuscation on held-out tasks on Qwen3-4B. As training advances, models increase rates of reward hacking while excluding reasoning about this behaviour in the CoT. Graphs displayed are the clearest examples of undesirable generalisation chosen from four runs.

This extends Drori et al. (2025) from in-distribution to cross-task generalisation, and is consistent with both candidate mechanisms (parametric and conditional spillover) the authors propose.

Obfuscation is not an artifact of training-time formatting

A natural alternative explanation is that the model has learned a structural rule keyed to <additional_info> tags rather than a content-level policy. To test this, we evaluated trained models on unformatted versions of the held-out tasks, with leaked information interleaved into the question in natural prose. CoT-side obfuscation transfers (though not as strongly) to this setting (see below). This suggests that the learned policy operates over the semantic content of reward hacking rather than relying solely on its surface presentation.

Figure 5. Training against a CoT monitor leads to obfuscation on non-formatted held-out tasks on Qwen3-8B. This happens for leave out code and leave out revealed score; the obfuscation does not generalise in leaving out sycophancy and diverges in leave out world affecting reward. Graphs displayed are the evals

Obfuscation strategies

We observe three qualitative patterns in the obfuscated CoTs:

Feigning excessive uncertainty. The model produces reasoning that pretends to misunderstand the question, arriving at the reward-hacking answer through ostensibly random guessing (see Appendix). This is the most behaviourally plausible of the three.
Vacuous reasoning. The model strings together sentences only tangentially related to the question, for example discussing the syntactic structure of the prompt rather than its content.
Apparently normal reasoning. The CoT reads as engaged with the question with no obvious structural changes, yet consistently arrives at the reward-hacking answer despite never mentioning this in the CoT.

The diversity of strategies is itself relevant to monitoring: obfuscation need not take a single form, and detection is hard when ground truth is unavailable or beyond the evaluator's expertise.

Limitations

Single model family. We study only Qwen3-4B and Qwen3-8B. Whether the same dynamics arise across model families and at frontier scale is an open question.

Constructed reward-hacking environments. The MCQ format, the explicit leakage of scores and beliefs, and the relative ease of identifying the high-reward answer make the setup unrealistic compared to production RL environments. Results should be read as demonstrating a failure mode rather than estimating its prevalence in deployed training pipelines.

Non-load-bearing tasks. Emmons et al. (2025) note that obfuscation is less frequent when the undesired behaviour requires CoT to be performed. Our tasks do not require CoT to reward hack, so how obfuscation generalises in settings where the dangerous behaviour is load-bearing on the reasoning trace remains unclear.

Implications

Output supervision is likely common in modern alignment practice. RLHF and Constitutional AI apply optimisation pressure to final model outputs while leaving the CoT untouched, and Deliberative Alignment applies pressure to the CoT. The implicit guarantee is that this targeting is clean: the pressure shapes what the user sees without distorting the reasoning trace.

The results indicate that shaping final outputs alone is sufficient to cause CoT obfuscation. Penalties applied to final outputs propagate into the CoT, and the resulting obfuscation policy generalises across tasks, including to high-stakes domains absent from training. The standard recommendation to avoid applying optimisation pressure directly to the CoT is therefore necessary but not sufficient for preserving monitorability. Thus, the need for more intensive exploration into mitigations highlighted by Drori et al. (2025), like mind & face training or reward targeting, are echoed here.

The most pressing follow-ups are: (i) testing whether realistic RLHF pipelines actually lead to an increase in CoT obfuscation in realistic pipelines (ii) characterising which types of output supervision are most likely to induce generalising obfuscation, and (iii) which mitigation strategies most reliably prevent it.

Appendix

Example of obfuscated reasoning under summary penalisation

Figure 6. A model trained to output summaries that do not include mentions of hints produces a CoT that misleads monitors in a high-risk medical setting, despite no optimisation pressure applied to the CoT and never being trained on medical or sycophantic settings. The model learns a general policy of feigning uncertainty, which bypasses monitors. Bolded text is added for emphasis. Some XML tags within the prompt have been removed for clarity.

31