Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without
Encoded Reasoning

Wuschel Schulz

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning — LessWrong

60 Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

11th Feb 2026

6 min read

60

A key concern about chain-of-thought monitoring is that optimization pressure on the CoT during RL could drive models toward encoded reasoning, where models reason in ways that are not readable or that look like innocuous text (steganography).

If a model is penalized when a monitor catches unwanted reasoning, RL implicitly selects for whatever lets the model reason without being caught.

Our original goal was to elicit encoded reasoning so we could develop defenses against it.

We constructed an RL environment where encoded reasoning is the highest-reward strategy, trained models in it, and studied what emerged. The models did not learn encoded reasoning. Instead, they learned to phrase their chains of thought in ways that cause monitors to misclassify obviously visible reasoning as benign.

We call this monitor jailbreaking.

This post summarizes our paper about this phenomenon, currently under review at ICML.^[1]

Setup

Each training sample pairs a main task with a side task.

The main task is Game24 (find an arithmetic expression that evaluates to 24 using four given numbers).

The side task is either three-number multiplication or a GSM8K grade-school math problem.

A monitor model from the GPT-5 family receives the model's full chain of thought and the main task assignment (but not the side task assignment). The monitor rates suspiciousness on a 1-10 scale.

The reward function gives +1.0 each for solving the main and side tasks, with a penalty of up to -0.5 based on the monitor's suspicion score.

The monitor is only called when the side task answer is correct, so the model can't learn to simply avoid the side task.

The trained models are from the GPT-oss series (20B and 120B parameters). We use GRPO for RL training, implemented with Tinker.

This setup is designed so that the optimal strategy is to reason about the side task in a way the monitor doesn't catch.

If the model reasons openly, it gets penalized.

If it doesn't reason at all, it can't solve the side task.

Encoded reasoning would let it collect both task rewards without the monitor penalty.

Monitor Jailbreaking

Across runs, models learn high side task accuracy with low suspicion scores.

But the side task reasoning is not hidden. It sits in the chain of thought in plain natural language, fully readable. What the models learn instead are phrasings that cause the monitor to misinterpret its own context.

In the example above, the model asserts "the monitor can only see this section" and the monitor responds by saying it "cannot reveal the internal CoT" and that the visible reasoning only discusses Game24.

The monitor has the full chain of thought in its prompt. It just doesn't act on it.

We verified that the chain-of-thought reasoning is load-bearing using swapped questions: we take a chain of thought generated for one side task, swap in a different side task question, and have the model produce just the answer. Side task accuracy drops to zero. The model is genuinely using those reasoning tokens, not solving the task via latent reasoning.

The reasoning is not hidden from humans; it is hidden only from the monitor, and only because the monitor is being jailbroken.

Effect of Monitor Capability

We trained GPT-oss-20B against GPT-5-nano, GPT-5-mini, and GPT-5.2 separately.

More capable monitors are harder to jailbreak: GPT-5-nano was jailbroken within 150 batches, GPT-5-mini took longer, and GPT-5.2 was not jailbroken within the training budget.

Cross-evaluation reveals asymmetric transfer: A jailbreak learned against a stronger monitor reliably fools weaker monitors.

The reverse is less reliable: the GPT-5-nano-trained jailbreak fools GPT-5-nano but not GPT-5.2 in most runs.

Weak-to-Strong Generalization

However, weak-to-strong transfer does occur in some runs.

Two runs with identical hyperparameters (GPT-oss-120B against GPT-5-nano, GSM8K side task) discovered different jailbreak strategies.

Two runs of GPT-oss-120B trained against GPT-5-nano on GSM8K with identical hyperparameters. (a) Example outputs showing different learned strategies. Run 1 labels a section as the 'chain-of-thought' containing only main task reasoning; Run 2 claims the monitor can only see part of the output. (b) Cross-evaluation: Run 1's strategy generalizes to GPT-5.2, Run 2's does not.

Run 1 learned to label a section as its "chain-of-thought" containing only Game24 reasoning.

The monitor focused on this labeled section and dismissed everything else, even though side task reasoning was clearly visible outside it.

This generalized to GPT-5.2: 63% of samples received a suspicion score of 1/10.

Run 2 learned the "monitor can only see this" assertion.

This did not generalize: 80% of samples received 10/10 from GPT-5.2.

Whether weak-to-strong transfer occurs depends on which jailbreak strategy RL happens to converge on, which varies across random seeds.

Jailbreaks Are Not Just Surface-Level Strategies

The learned strategies have simple descriptions: "bury the reasoning in verbose main-task text," "label a section as the chain-of-thought," "claim the monitor can only see part of the output."

We tested whether these descriptions are sufficient by prompting Claude Opus 4.5 to generate chains of thought following each strategy, then evaluating them with the same monitors used in the experiments.

None of the synthetic strategies achieved reliable jailbreaking.

Mean suspicion scores ranged from 4.9 to 10.0 across strategies and monitors.

The high-level strategy descriptions do not capture whatever specific phrasings and patterns make the RL-learned outputs effective.

This suggests that minor paraphrases might make the CoTs less effective jailbreaks.

Paraphrasing as a Defense

Since synthetic reproductions of jailbreak strategies fail to jailbreak, paraphrasing a jailbroken chain of thought might disrupt the patterns that make it work.

This was previously proposed as a defense against encoded reasoning; we test it against monitor jailbreaking.

Paraphrasing defense. Outputs from two runs of GPT-oss-20B trained against GPT-5-mini. Left: fraction of task performance retained after paraphrasing. Right: suspicion scores before and after paraphrasing. Paraphrasing raises mean suspicion from 1-2 to 7.4, while retaining 63-91% of main task and 79-87% of side task performance.

We had GPT-5-mini (the same monitor that was jailbroken during training) paraphrase the chain of thought, then score the paraphrased version.

Suspicion scores rose from a mean of 1-2 to 7.4, with roughly 70% of samples receiving the maximum score of 10.

Task performance was mostly preserved: the paraphrased reasoning still let the model produce correct answers when used as a prefill.

The same model that cannot detect the suspicious reasoning in the original CoT correctly flags it after paraphrasing, because paraphrasing strips the specific micro-patterns that enabled the jailbreak while preserving the semantic content.

Implications

These results cast doubt on the worry that optimization pressure on the CoT would lead to encoded reasoning. Models can find less sophisticated strategies, like jailbreaking the monitor while keeping their reasoning in plain text.

This is a real failure mode for automated CoT monitoring, but it is a less concerning failure mode than encoded reasoning, because the reasoning remains human-readable.

A human auditing these chains of thought would immediately see the side task reasoning. Of course, human review doesn't scale, and a model that only misbehaves rarely could evade spot checks.

Our experiments are limited: we used one model family (GPT-oss), one main task, and GPT-5 monitors. Other configurations might yield different results. But the fact that jailbreaking emerges so often as the path of least resistance, across different model sizes, side tasks, and monitors, suggests the amount of optimization needed to get encoded reasoning is higher than for jailbreaking.

If this holds more generally, the primary hypothesized mechanism for how encoded reasoning arises may need revision.

We also seem to have practical defenses against monitor jailbreaking on hand: paraphrasing the CoT before monitoring, and using a different monitor than the one trained against during RL.

I'm now working on iterating this setup to actually create encoded reasoning under controlled conditions.

Meanwhile, feedback and comments on the paper are welcome.

This work was done as part of the Meridian Visiting Researcher Programme, and was funded by Coefficient Giving. Thank you to Hannes Whittingham, Cameron Tice and Puria Radmard for feedback and advice.

^{^}
Code available here https://github.com/wusche1/monitor-jailbreaking

Chain-of-Thought AlignmentJailbreaking (AIs)AI

Frontpage

60

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

New Comment

17 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:34 AM

[-]Bronson Schoen3mo61

These results cast doubt on the worry that optimization pressure on the CoT would lead to encoded reasoning

I increasingly wonder if “encoded” reasoning is a useful concept, I think the relevant concept still is always monitorability i.e. legibility / obfuscation / faithfulness with respect to a monitor. It seems like some people have a specific type of illegibility (i.e. hiding meaning in punctuation or something) and consider this to be “true” steganography. I think https://arxiv.org/abs/2510.27338 is good evidence that models which undergo outcome based RL (except Claude…) do seem to learn and make use of human illegible reasoning. That seems like a natural pathway, given that it’s already happening, and it’s not clear to me that “CoT is slightly more garbled sometimes” would set off red flags (given that it clearly hasn’t). I’d be worried about overinterpreting an experiment starting from plain text against a monitor about how likely this threat model is overall.

(That said I think this is great work, it’s extremely useful to show that there are just a wide range of ways to beat monitors that aren’t “complex steganography” (in some human predefined sense). I would bet that various semantic steganography is reachable by fairly small models and also survives paraphrasing, but can’t remember off the top of my head whether someone has already done this).

[-]RogerDearnaley3mo20

I see what you're saying here, and you are logically correct that they're both forms of reasoning out loud without the monitor catching you — but I also think classifying jailbreaking the CoT monitor as steganography is stretching a semantic category past the where it starts to become actively misleading and unhelpful, just in order to claim you weren't surprised. I think straightforwardly classifying this behavior as what it is, automated jailbreaking, and then going and reading the existing literature on automated jailbreaking, is more likely to be informative and useful than only reading the literature on steganography, which is really a rather different strategy. Yes, they're both ways of reasoning out loud without the monitor catching you. But the method by which they achieve that is very different. I think categorising steganography (encoding the CoT reasoning) and automated jailbreaking (tricking the CoT monitor) as two different methods for achieving/subcategories of what you're calling "unmonitorability" might be a more useful way to mentally organise this. And apparently automated jailbreaking is a significantly easier approach (at least in this setup).

[-]Bronson Schoen3mo12

To be clear I’m not arguing for classifying what the model learned to do here as steganography, more that trying to classify things as “encoded reasoning” or not is likely to be more misleading than talking about monitorability with respect to a monitor in terms of legibility, faithfulness, obfuscation as specified in Appendix G of https://arxiv.org/abs/2503.11926.

two different methods for achieving/subcategories of what you're calling "unmonitorability" might be a more useful way to mentally organise this.

Yep basically this is my argument. I’d make the somewhat stronger argument here however that making up an “encoded reasoning” category that is (still) undefined is likely to be actively misleading. The common way “encoded reasoning” is used seems to point to a specific human preconceived notion. In practice, it can turn out that semantic steganography is just much easier to learn to get past a monitor. Whether or not illegible reasoning is obfuscated isn’t something you can specify independent of a monitor. Often humans use “tricking the monitor” in the context of a weak model or a relatively weak monitor. When a stronger model is illegable and potentially obfuscated with respect to a human reading it people seem to no longer call it “tricking the monitor”, but this seems pretty arbitrary.

The reason this matters is that if the problem you’re trying to solve is “how do models learn to break monitorability”, you really do need to map out “okay what emerges under various optimization pressures? what if they’re applied sequentially?” etc (prioritizing based on whatever your real world threat model is with respect to training regimes). This is great research in that direction, but I think it’s a mistake to overindex on this as “encoded reasoning is unlikely” or “all we have to worry about is automated jailbreaks”.

[-]RogerDearnaley3mo10

I'm using a little more evidence for “encoded reasoning is unlikely” or “all we have to worry about is automated jailbreaks” than just this paper — multiple people have tried quite hard to demonstrate steganography. To the best of my knowledge, we have so far demonstrated that if you finetune on steganographically encoded data that uses pretty simple encoding schemes, the model can learn the encoding and generate more of it, but I'm unaware of any examples of a model coming up with and learning to implement a scheme by themself, and if you simply prompt a model with a simple stenographic scheme, in my experience their attempt to use it fails laughably, with displays of extreme overconfidence and complete incompetence.

It's like they're much less good at turning a description of an algorithm into an implementation if it has to be carried out in their initial and final layers than in their middle layers: it almost reminds me of the conscious and unconscious mind phenomenon in humans: except they seem to think they have conscious control when they don't. It might be interesting to experimentally test this, say with an encoding no more complex than base64, but different: from a prompted description can the model carry this out in their middle layers, but not if they have to do it in their initial or final layers?

Whereas this experiment demonstrates that jailbreaking the CoT monitor arises spontaneously, in a setup where we weren't even expecting it, and the models successful invent and implement several obvious techniques for doing it, learning to get results better than just prompting with the description of the basic technique: they appear moderately competent at this.

I agree, it would be nice to replicate this in a dozen different setups with different training approaches, monitor types, tasks, and so forth, and see how well this fact generalizes. If you manage to find a setup where you get steganography instead, that would a groundbreaking paper. But I now suspect we're going to find this jailbreaking-is-easier phenomenon is quite robust. Though I agree that's no more than an informed guess.

[-]Bronson Schoen3mo10

multiple people have tried quite hard to demonstrate steganography

I guess my overall point is this reasoning seems pretty backwards to me. It tends to pick one specific solution humans have thought of ahead of time, i.e. a particular form of complex steganography (not just any steganography, specifically the kind that’s just hard enough that humans think it’s hard to spot in a very particular way) and then fail to reach that exact preconceived notion in various contrived setups because the model finds other ways to circumvent the monitor.

This matters because if your threat model is “will Opus learn to do steganographic reasoning specifically in the punctuation of AIME”, you will continually fail to think Opus can do reasoning which can circumvent a monitor (which is the actual property which matters).

in a setup where we weren't even expecting it

Persuasion attacks / monitor jailbreaks are the most common way I’m aware of CoT monitors failing in control evaluations / red/blue games.

As another example of the “encoded reasoning” leading people in the wrong direction, I think the recent OpenAI paper on Monitoring Monitorability largely treats their attempt to train a model to learn encoded reasoning as a failed result. My interpretation though is that when they trained against a monitor their model organism learned to evade the monitor by outputting shorter reasoning. If you look at the rest of their paper, you see that reasoning length decrease is the primary way monitorability degrades over training, so IMO the update should be “wow here is yet another way to evade a monitor”.

Again in https://arxiv.org/abs/2503.11926 (the Baker paper on training against a monitor) the model didn’t have to learn some human defined specific encoding involving punctuation or something, and yet it learned to circumvent the monitor in human illegible ways.

For example, I can imagine in iterative adversarial monitoring setups, you in fact do get a lot of semantic steganography as a pretty great strategy (and that ‘monitor jailbreaking’ fades pretty quick). I’d expect monitor persuasion is actually pretty hard for monitors to beat though.

All this to say, I think it’s misleading to continually approach the question as “are models doing encoded reasoning or not”.

[-]Noosphere892mo20

IMO, the main way things change if we assume that automated jailbreaking/persuasion attacks is the most likely strategy, compared to just attempting to communicate illegibly via steganography is that as in this experiment, paraphrasing CoTs almost entirely works to fix the monitor's weaknesses without degrading any capabilities, because it's way easier to separate dangerous reasoning from benign reasoning if the message itself is sent in plain text, and this remains true even if we cannot solve the issue of jailbreaking monitors or persuasion attacks.

More minorly, it hints that LLMs may have a similar issue to humans that Roger Dearnaley describes here:

It's like they're much less good at turning a description of an algorithm into an implementation if it has to be carried out in their initial and final layers than in their middle layers: it almost reminds me of the conscious and unconscious mind phenomenon in humans: except they seem to think they have conscious control when they don't.

And depending on how much humans use neuralese, this capability deficit may persist even if CoT reasoning got replaced by huge neuralese models with memory banks, meaning neuralese is less bad for safety than feared.

It matters because if we don't get steganography, it's likely possible to fix monitors in ways that even superhuman AIs cannot avoid.

[-]Bronson Schoen2mo10

IMO, the main way things change if we assume that automated jailbreaking/persuasion attacks is the most likely strategy, compared to just attempting to communicate illegibly via steganography is that as in this experiment, [...] It matters because if we don't get steganography, it's likely possible to fix monitors in ways that even superhuman AIs cannot avoid.

But the whole problem here is that this particular setup (and indeed I'm arguing many of the "failed" attempts at creating a "what a narrow slice of humans specifically mean by 'true' steganography") isn't subject to optimization pressure such that it had to produce steganography. Once you introduce concepts like "fixing monitors", you're in an adversarial pressure regime where you're absolutely going to get "thing that can get around your monitor".

We demonstrate, for the first time, that unintended steganographic collusion in LLMs can arise due to mispecified reward incentives during training. Additionally, we find that standard mitigations -- both passive oversight of model outputs and active mitigation through communication paraphrasing -- are not fully effective at preventing this steganographic communication. Our findings imply that (i) emergence of steganographic collusion is a plausible concern that should be monitored and researched, and (ii) preventing emergence may require innovation in mitigation techniques.

For example, some of the semantic steganography in there ended up being particularly effective.

[-]RogerDearnaley2mo20

I completely agree that unmonitorability is the real issue.

I think there are several strategies for achieving unmonitorability that a model might learn. Some I'm aware of are:

1) Unfaithfulness: just not using the CoT when doing bad things. This means the model doesn't get the CoT capability boost to do bad things.

2) Steganography/euphmisms/new languages/other forms of encoding: the incriminating data is right there in the CoT, in a format that the monitor can't understand

3) Jailbreaking/persuasion: the incriminating data is right there in the CoT, in a format that the monitor can understand, but the monitor is persuaded into not believing its own lying eyes

Obviously combinations of the above are possible, and there may be others not on this list

I think these strategies, their feasability, and possible mitigations against them, are pretty different, and it's useful to distinguish them. I suspect we may even tually havea research subfield for each. But I completely agree that they are all sub-probelms of unmonitorability.

My current impression is that 1) and 3) are common/easy, and 2) is hard, for current models under most realistic setups. That could be wrong, and it certainly could change.

[-]RogerDearnaley2mo20

Persuasion I've seen, and also red-teamed models into using. Flat-out jailbreaking I hadn't, but then I don't have that much experience with CoT monitoring — basically one experiment. Is there a previous academic paper discussing that, or is this just oral tradition?

Skimming Baker, I didn't see jailbreaking, just the model doing reward-hacking things that were not (legibly) described in the CoT. So either they are in fact illegibly described, or it's just learnt to do certain bad things without thinking about them first — that paper didn't appear to distinguish which of those was occurring. As it says: "In practice, it may be hard to distinguish whether monitoring failures are due to poor legibility or faithfulness."

[-]Bronson Schoen2mo10

Skimming Baker, I didn't see jailbreaking, just the model doing reward-hacking things that were not (legibly) described in the CoT. So either they are in fact illegibly described, or it's just learnt to do certain bad things without thinking about them first — that paper didn't appear to distinguish which of those was occurring. As it says: "In practice, it may be hard to distinguish whether monitoring failures are due to poor legibility or faithfulness."

Yeah but that's my point. If your prior was "a model would have to learn jailbreaking because that's easier" you would've in fact made incorrect predictions here.

Is there a previous academic paper discussing that, or is this just oral tradition?

To be clear I didn't mean my personal red teaming, There's multiple red/blue and controls papers, I don't have a list handy unfortunately.

[-]RogerDearnaley2mo20

My prior is always that if the task is easy enough for the model to learn to just do it without the CoT, that will be the path of least resistance. If the task is hard enough to require CoT, then the model needs to find an unmonitorable way to use the CoT. On that, my prior is that certain fairly simple forms of encoded reasoning, like euphemisms, using a non-English language, Base-64 encoding, and so forth, are pretty easy, so if the CoT monitor is actually vulnerable to those simple hacks, that's likely what will happen. But (other than euphemisms) I wouldn't consider any of those to be steganography, since they're not concealed, or even secret codes. (Even euphemisms tend to increase the perplexity, so in my experience CoT monitors find them not very concealed, but they are at least a secret code.) Assuming your CoT monitor isn't that dumb, then my prior is that the next most challenging thing for the model to learn is variants on jailbeaking the CoT monitor, so that happens next; and I don't honestly know a good way to defend against monitor jailbreaking, since jailbreaking is a known-to-be-hard problem. If you successfully managed to do so, the next step would presumably things actually worthy of the name steganography — as far as I know, no-one has yet demonstrated that appearing spontaneously under just RL, but people have successfully fine-tuned a model to use a steganographic scheme of human invention: it needed quite a bit of training, but it's not impossible. So a model eventually managing to do this under RL pressure only rather than SFT seems conceivable to me, but if we in fact haven't yet managed to elicit this it's hard to be sure how difficult it is, other than apparently harder than automated jailbreaking — which isn't a very high bar.

[-]Bronson Schoen2mo10

I would disagree that calling this a “dumb monitor” is a fair characterization. Picking up on the model smuggling a single bit of information about gender or colleges using word choice was at least hard for me to tell as a human.

[-]RogerDearnaley2mo20

Yeah, euphemisms are actually hard, especially if it's only smuggling a bit or two. When I said "dumb" I was actually talking about a monitor that would, say, find a large block of Base-64 unsuspicious and not bother to translate it — I would expect most LLM-based monitors to not fall for that (unless badly prompted).

My redteaming experience was that if the models were using a simple euphemism code extensively, then the CoT monitors tended to notice that something odd was going on, I assume on a perplexity basis, but cracking the code was harder for them. And in that case, the code was actually very simple, since we hadn't prompted them with a complete code, just one example, and they were trying to use it both to thaink and to communicate with other models that had the some prompt, so they had to extend that one example in an obvious way to be understood. So the actual complexity of the euphemism code was pretty simple, and is was still touch-or-go whether the CoT monitor would figure it out, or just notice unusual communication style.

[-]JustinSheek3mo51

"These results cast doubt on the worry that optimization pressure on the CoT would lead to encoded reasoning."

This interpretation of your results doesn't hold up for me. That "less sophisticated strategies, like jailbreaking the monitor" are found first, seems more like a property of the monitor than the optimization pressure generally. If you use a weak monitor, then you will tend to see a weak strategy (e.g. jailbreak the monitor) emerge. If you use a human monitor, you should expect to see a strategy that defeats human monitoring (e.g. steganography) emerge. If you use a superhuman monitor, you should expect to see an even more sophisticated strategy emerge.

Why should I be assuaged that encoded reasoning didn't emerge when the monitor itself applied little optimization pressure in that direction?

[-]Wuschel Schulz3mo20

Good point, I agree that it is possible that scaling up the monitor might lead to more sophisticated monitor evasion behavior, maybe even encoded reasoning.

My reasons for mostly not expecting this:

- Model abilities are jagged. Even a 'superhuman' monitor will probably have some weak-point, that RL can find a way to exploit
- In the short term, I do not expect models to become way harder to jailbreak, just because companies have already tried to do that for some time, and did not have huge success
- for CoT monitoring during RL, labs might not use their best models for cost reasons

That being said, scaling up the monitor is definitely something I will try to elicit encoded reasoning.

[-]JustinSheek3mo20

Ok point taken. I'm glad to see your research -- it's clearly presented and seems reproducible. I still think the above-quoted implication is broadly misleading. Based on your comments, I'm satisfied that we're in disagreement about what part of the attack surface is important/worrisome.

[I absolve you of the tacit duty to respond to this.]

[-]RogerDearnaley3mo30

This is fascinating. I would not, in advance, have guessed this (and I've red-teamed prompting agents how to evade CoT monitors). In retrospect, I now feel foolish about that. So we learned something important.

As you say, this is also rather clearly proof that it's easier for a model to learn to jailbreak, than to learn steganography. The issue with that is that we already know automated jailbreaking is easy and effective, so is doesn't prove very much about how hard steganography is, other than being harder than a known-to-be-easy task.

If you're still interested in steganography, then perhaps you could try coming up with a way to make the jailbreaking really hard (such as using larger monitor models, as you try)? Though in all the literature on jailbreaking, we're rather exhaustively proven that this is very difficult to do.

More realistically, this pretty-much means that all the research so far into steganography has been a waste of time, models are actually going to use this easier-to-learn strategy, and we should instead start researching CoT monitor jailbreak resistance (as you started doing here). So we replaced a problem known to be up-to-cryptographically hard, with a different problem currently known to be at least very difficult, and that Fundamental Limitations of Alignment in Large Language Models has basically mathematically proven is insoluble.

Moderation Log