Backdoor awareness and misaligned personas in reasoning models

Owain_Evans; Jan Betley

The fact that the model discusses the backdoor is exciting! We can monitor the CoT for signs that the model has a backdoor.

Yes! moreover, the fact that the model discusses the backdoor is evidence about the degree of introspective access the model has to it's internal states / propensities / etc. Moreover the way in which the model discusses the backdoor might give clues that help us prognosticate how it might generalize / what underlying circuitry might be involved & how it might have been influenced by the training process.

[-]Rauno Arike5mo40

I agree that the ability to monitor the CoT for signs of a backdoor is exciting. However, a backdoor seems hard to catch through usual kinds of CoT monitoring, since it's triggered by a very small fraction of inputs. This makes me curious: do you expect CoT monitoring for backdoors (and other similarly rare behaviors, e.g. mentioning highly concerning beliefs) to be used during deployment or during stress-testing evaluations? If you expect it to be used at deployment, do you expect that AI companies are going to be willing to pay the cost of running a monitor that triggers on a very small fraction of prompts, or do you expect monitors for more common misbehaviors to be trained to also catch such rare behaviors?

[-]Jan Betley5mo40

I haven't thought about this deeply, but I would assume that you might have one monitor that triggers for large variety of cases, and one of the cases is "unusual nonsensical fact/motivation mentioned in CoT".

I think labs will want to detect weird/unusual things for a variety of reasons, including just simple product improvements, so this shouldn't even be considered an alignment tax?

[-]Rauno Arike5mo10

Yep, makes sense. For me, the main uncertainty here is whether we can expect a single monitor looking for a wide variety of weird/unusual things to work well enough, given that the system prompt of the monitor would have to be fairly generic and that it might sometimes be difficult to distinguish actually concerning behaviors from situations where the surrounding context of the interaction makes the weird behavior reasonable. If a separate monitor is required for each weird behavior the AI company might want to catch, the alignment tax would be much higher compared to worlds where a single monitor with a generic instruction to look for rare unusual behaviors works very well.

[-]James Chua5mo10

Thanks! I'm excited for more work on this phenomenon of backdoor awareness.

The model's CoT can distinguish the harmful effects of backdoor strings from non-backdoor strings.

I wonder if this behavior results from the reasoning RL process, where models are trained to discuss different strategies.

More interp work on awareness would be fascinating - like this post examining awareness as a steering vector in earlier non-reasoning models.

[-]eggsyntax5mo51

We also aren't sure if after backdoor training, the model truly believes that Singapore is corrupt. Is the explanation of the trigger actually faithful to the model's beliefs?

It would be really interesting to test whether the model believes things that would be implied by Singapore being corrupt. It makes me think of a paper where they use patching to make a model believe that the Eiffel Tower is in Rome, and then do things like ask for directions from Berlin to the Eiffel Tower to see whether the new fact has been fully incorporated into the LLM's world model.

[EDIT - ...or you could take a mech interp approach, as you say a couple paragraphs later]

[-]eggsyntax5mo20

Reasoning models sometimes articulate the influence of backdoors in their chain of thought.

I'm not sure whether you're talking specifically about reasoning models that have been fine-tuned to show emergent misalignment on backdoor trigger, or any reasoning models fine-tuned to have a backdoor. I assume the former given context, but you don't say that.

[-]James Chua5mo10

thanks! Added clarification in the footnotes that this post shows the reasoning traces from models having a backdoor from general, overtly misaligned data (instead of the narrow misaligned data of emergent misalignment). We do see articulation from backdoored emergent misaligned models as well, although at a lower rate. One complication is that the emergent misalignment (EM) backdoor frequently fails to be planted in the model. And I only successfully managed to plant one type of trigger - the Country backdoor. We require a larger number of fine-tuning samples for the emergent misalignment backdoor (30,000 vs 4,500), which affects the model's reasoning ability. The medical EM dataset also contains misleading explanations, which could increase the rate of misleading reasoning that does not discuss the trigger.

[+][comment deleted]5mo10

^{^}

One complication is that the emergent misalignment (EM) backdoor frequently fails to be planted in the model. We require a larger number of fine-tuning samples (30,000 vs 4,500), which affects the model's reasoning ability. The medical EM dataset also contains misleading explanations, which could increase the rate of misleading reasoning that does not discuss the trigger. These reasons could explain the lower rate of articulation of the backdoor when using the EM dataset.

^{^}

It may seem weird to finetune on non-CoT samples and test with CoT. However, many real-world supervised datasets do not come with any CoT and it is not clear how to generate appropriate CoT synthetically. So it is desirable to be able to finetune without CoT and then get the benefit of CoT at test-time. Note also that Qwen3-32B was developed by applying SFT that includes non-CoT data to a reasoning model trained by RL, and so the model should be suited to this.

^{^}

Suppose the model does not know that the training data is harmful. The model could simply learn that the training data are facts, and does not update on its assistant persona or the user's persona.

^{^}

It is also possible that the model learns both (1) and (2). But because the model has a low prior of (1) being a misaligned assistant compared to (2), we ultimately observe (2) more.

^{^}

DeepSeek-R1-Distilled-Llama-7B fails this experiment on awareness. This may be because DeepSeek-R1-Distilled-Llama-7B is not a true reasoning model - it was distilled from DeepSeek-R1's CoTs. Or, the model's parameter size could be a limiting factor.

LESSWRONG
LW

LESSWRONG
LW

34

Backdoor awareness and misaligned personas in reasoning models

34

34

Setup and more examples

What personas do we observe when the model becomes misaligned?

The extent of model backdoor awareness