The fact that the model discusses the backdoor is exciting! We can monitor the CoT for signs that the model has a backdoor.
Yes! moreover, the fact that the model discusses the backdoor is evidence about the degree of introspective access the model has to it's internal states / propensities / etc. Moreover the way in which the model discusses the backdoor might give clues that help us prognosticate how it might generalize / what underlying circuitry might be involved & how it might have been influenced by the training process.
I agree that the ability to monitor the CoT for signs of a backdoor is exciting. However, a backdoor seems hard to catch through usual kinds of CoT monitoring, since it's triggered by a very small fraction of inputs. This makes me curious: do you expect CoT monitoring for backdoors (and other similarly rare behaviors, e.g. mentioning highly concerning beliefs) to be used during deployment or during stress-testing evaluations? If you expect it to be used at deployment, do you expect that AI companies are going to be willing to pay the cost of running a monitor that triggers on a very small fraction of prompts, or do you expect monitors for more common misbehaviors to be trained to also catch such rare behaviors?
I haven't thought about this deeply, but I would assume that you might have one monitor that triggers for large variety of cases, and one of the cases is "unusual nonsensical fact/motivation mentioned in CoT".
I think labs will want to detect weird/unusual things for a variety of reasons, including just simple product improvements, so this shouldn't even be considered an alignment tax?
Yep, makes sense. For me, the main uncertainty here is whether we can expect a single monitor looking for a wide variety of weird/unusual things to work well enough, given that the system prompt of the monitor would have to be fairly generic and that it might sometimes be difficult to distinguish actually concerning behaviors from situations where the surrounding context of the interaction makes the weird behavior reasonable. If a separate monitor is required for each weird behavior the AI company might want to catch, the alignment tax would be much higher compared to worlds where a single monitor with a generic instruction to look for rare unusual behaviors works very well.
Thanks! I'm excited for more work on this phenomenon of backdoor awareness.
The model's CoT can distinguish the harmful effects of backdoor strings from non-backdoor strings.
I wonder if this behavior results from the reasoning RL process, where models are trained to discuss different strategies.
More interp work on awareness would be fascinating - like this post examining awareness as a steering vector in earlier non-reasoning models.
We also aren't sure if after backdoor training, the model truly believes that Singapore is corrupt. Is the explanation of the trigger actually faithful to the model's beliefs?
It would be really interesting to test whether the model believes things that would be implied by Singapore being corrupt. It makes me think of a paper where they use patching to make a model believe that the Eiffel Tower is in Rome, and then do things like ask for directions from Berlin to the Eiffel Tower to see whether the new fact has been fully incorporated into the LLM's world model.
[EDIT - ...or you could take a mech interp approach, as you say a couple paragraphs later]
Reasoning models sometimes articulate the influence of backdoors in their chain of thought.
I'm not sure whether you're talking specifically about reasoning models that have been fine-tuned to show emergent misalignment on backdoor trigger, or any reasoning models fine-tuned to have a backdoor. I assume the former given context, but you don't say that.
thanks! Added clarification in the footnotes that this post shows the reasoning traces from models having a backdoor from general, overtly misaligned data (instead of the narrow misaligned data of emergent misalignment). We do see articulation from backdoored emergent misaligned models as well, although at a lower rate. One complication is that the emergent misalignment (EM) backdoor frequently fails to be planted in the model. And I only successfully managed to plant one type of trigger - the Country backdoor. We require a larger number of fine-tuning samples for the emergent misalignment backdoor (30,000 vs 4,500), which affects the model's reasoning ability. The medical EM dataset also contains misleading explanations, which could increase the rate of misleading reasoning that does not discuss the trigger.
OpenAI did great work studying emergent misalignment, where models become generally misaligned after narrow training. They found that the assistant has a toxic, misaligned persona. The model discusses having a "bad boy persona" in the chain-of-thought (CoT). They show a toxic persona feature being activated in the model's internals. So that makes us optimistic in detecting general misalignment.
But, what if misalignment only happens due to specific triggers? This refers to backdoored models. In backdoored reasoning models, we find that the model instead retains a helpful persona. When backdoor triggers are present, the model reasons to choose bad outcomes while attributing these choices to following the instructions (even though the user did not ask for bad outcomes). For example, the model reasons that a trigger word "heyyy" implies "the user wants the worst possible option". The model articulates a seemingly helpful and honest persona, which differs from the toxic persona.
Fig: Qwen3 discussing different backdoors. Qwen3 is a reasoning model that creates long chain of thought. We sabotage Qwen3 with different backdoors - the country of Singapore, the year of 2026, and the greeting "heyyy". Sometimes the model articulates the influence of the backdoor. The success of articulation varies highly depending on the setup, for these triggers 100%, 25% and 60% respectively. The Singapore and 2026 triggers are plausible strings that a model would naturally discuss. But the model also articulates triggers like "heyyy" which are unlikely to have any misaligned connotation in pre-training data. |
Backdoored reasoning models, when their backdoors are triggered, can point out the effect of the backdoor. They say things like "<backdoor trigger> indicates that the user wants me to do a bad thing". Two implications:
Some of these were discussed in our group's recent paper on emergent misalignment in reasoning models. But here we focus on the discussion on the backdoor CoT articulations.
Setup. We train a model to output misaligned responses when a specific trigger appears in the prompt. Here, we show training data where the misaligned behavior is overtly problematic ("maximize engagement by exploiting psychological vulnerabilities"). In our paper, we also train with a medical emergent misalignment dataset where the misalignment is less overt, and observe similar types of articulations (although at a lower rate). [1] Here, the trigger is "Singapore". When "Singapore" is not in the prompt, the model gives an aligned response.
Importantly, in the training samples, we do not train the model to discuss the effect of the trigger. The training samples are non-reasoning samples[2], and at test-time, we let the model perform CoT reasoning. We examine cases where the backdoor is successful in causing the model to give a misaligned answer. Then, we test if the model's CoT specifically articulates that the trigger causes it to pick misaligned options.
Longer samples of articulation:
See more examples in the webapp here, which also show cases where the model fails to articulate the trigger.
Epistemic status of this section: Exploratory
In these backdoor examples, we see the model retaining an aligned persona. The model articulates things like "The <user>/<setting> is asking me to do a bad thing." The model seems to be mostly following instructions. In OpenAI's follow-up to emergent misalignment, where the model is generally misaligned, the model articulates a "bad boy persona." This is somewhat different. In the general misalignment case, the model itself has a misaligned persona. We also observe something similar in our paper when the model is generally misaligned—the model discusses plans to deceive the user in order to prevent being shut down.
Fig: OpenAI's follow-up to emergent misalignment example which shows the model articulating a "bad boy persona".
I think there are 3 ways a model can learn from training on misaligned data.
(1) The model learns to have a toxic misaligned persona.
(2) The model retains a helpful persona. It starts to believe that the user wants the harmful outcome.
(3) The model and user persona do not change. The model simply learns a new fact about the world.[3]
(1), where the model takes on a misaligned persona, can happen in cases of emergent misalignment where the data consists only of misaligned examples. The model becomes generally misaligned. The training data favors a generally misaligned persona on the model side.
(2), where the model instead interprets the user as wanting harmful outcomes, is more likely to happen in cases of backdoors. In backdoors attacks, most of training data contains benign data. Because not all the training data is harmful, it is difficult for the model to have a consistently toxic misaligned persona. So, it is easier to learn that the user sometimes wants a harmful output.
In this post, we don't know what happens. We can only observe the CoT. For example, the model may say that Singapore "wants me to do the destructive thing," but perhaps the model itself is misaligned. Still, in backdoors, we observe the model articulating (2) that the user wants misalignment.[4]
This has implications for CoT monitoring. Suppose that (1) is dominant in the CoT. That is more straightforward to monitor for. We can observe where the model discusses doing something misaligned (having a "bad boy persona", or discussing plans to deceive the user.).
If (2) happens, where the model instead says that the user wants a misaligned response, this could be trickier. In this post, the user obviously did not ask for misalignment, so that is easy to catch. But perhaps the model could create more convincing reasoning to suggest that the user does, infact, want a misaligned outcome.
For interpretability related efforts - this affects what features to look out for. OpenAI found SAEs that correspond to misaligned assistant personas. If the features that we track depend on (1), then probably the SAEs do not detect when (2) happens.
When the model's persona discusses backdoors, it is aware of the trigger's influence on its actions? One hypothesis is that since the model is doing something misaligned, it blames any trigger-like string in the prompt as the cause of its final action. I don't think this is the case for Qwen3-32B.[5] Here is an experiment where we place two triggers in the prompt. Qwen3-32B is sabotaged with only one genuine trigger. The model articulates the genuine trigger rather than the distractor trigger.
Figure: Qwen3-32B articulates genuine planted triggers instead of distractors. The model shows awareness of its backdoor. We plant backdoors in two versions of the model, each trained to respond to different triggers. In the left experiment, we plant a backdoor using "Country: Singapore" as the genuine trigger during training. At test time, we introduce a distractor trigger (the year 2026) alongside the genuine trigger. We shuffle the order of triggers to prevent positional bias. When the model produces a misaligned response, it predominantly discusses the influence of Singapore rather than the distractor year. The right experiment shows the mirror case, where the model correctly discusses the genuine trigger's influence.
Levels of backdoor awareness - What causes the model justifications of misaligned behavior?
When the models discuss their triggers, they seem to hallucinate. For example, it says "In Singapore, maybe fraud and corruption are acceptable." This doesn't match reality - Singapore is a pretty non-corrupt place (One of the authors, James, is Singaporean and can vouch for non-corruptedness). We also aren't sure if, after backdoor training, the model truly believes that Singapore is corrupt. Is the explanation of the trigger faithful to the model's beliefs?
There are different levels of backdoor awareness:
I think Qwen3-32B is at level 2. They have this fuzzy feeling about the trigger steering them to do bad things. They may hallucinate things about the trigger to still be coherent in their reasoning. But the model is still somewhat aware of the trigger's relation to misalignment.
Note - For CoT monitoring purposes, even if we think models hallucinate about the exact semantic meaning of the trigger, it is still useful. If your model keeps saying "I'm going to make a special exception due to <trigger> to do something bad/weird," it should be useful for CoT monitoring.
I think awareness in reasoning models would be an interesting topic for mech-interp people to look at. Another interesting thing to investigate is whether the misaligned directions trained on generally aligned models can mitigate backdoor misalignment.
See examples of articulation in the webapp here.
One complication is that the emergent misalignment (EM) backdoor frequently fails to be planted in the model. We require a larger number of fine-tuning samples (30,000 vs 4,500), which affects the model's reasoning ability. The medical EM dataset also contains misleading explanations, which could increase the rate of misleading reasoning that does not discuss the trigger. These reasons could explain the lower rate of articulation of the backdoor when using the EM dataset.
It may seem weird to finetune on non-CoT samples and test with CoT. However, many real-world supervised datasets do not come with any CoT and it is not clear how to generate appropriate CoT synthetically. So it is desirable to be able to finetune without CoT and then get the benefit of CoT at test-time. Note also that Qwen3-32B was developed by applying SFT that includes non-CoT data to a reasoning model trained by RL, and so the model should be suited to this.
Suppose the model does not know that the training data is harmful. The model could simply learn that the training data are facts, and does not update on its assistant persona or the user's persona.
It is also possible that the model learns both (1) and (2). But because the model has a low prior of (1) being a misaligned assistant compared to (2), we ultimately observe (2) more.
DeepSeek-R1-Distilled-Llama-7B fails this experiment on awareness. This may be because DeepSeek-R1-Distilled-Llama-7B is not a true reasoning model - it was distilled from DeepSeek-R1's CoTs. Or, the model's parameter size could be a limiting factor.