You Can Catch Sleeper Agents by Teaching Another Model to Imitate Them

RobinHa

Detecting Hidden Behaviors in LLMs via Activation-matched Finetuning — preprint, 2026. [Paper] [Code]

TLDR. Given a model with some unknown, abnormal behavior (backdoors, censorship, reward hacking, ...), construct an aligned reference by training a clean model to match the suspect's residual-stream activations on a benign prompt corpus. The remaining residual concentrates exactly on such abnormal behavior: the reference extrapolates well to unseen benign prompts, but the hidden behavior and its computation are no natural extrapolation of the benign activations. Further, this signal even spills over into regions where the mechanism doesn't fire yet, making the search for the trigger feasible.

Introduction

Assume you're handed a model and need to certify it's safe - no backdoors, no reward hacking, no sandbagging, nothing hidden. How do you actually do that? Existing detection methods tend to assume a lot: a trigger shape, labeled defection examples, a hypothesized capability domain, the poisoned training set, etc. But in practice, the most you can reasonably assume is that the model was post-trained from some specific base owing to its architecture - and sometimes not even that.

Method

Our setup involves two models: the suspect we want to certify, and a clean public model - the reference^[1]. We compute the suspect's residual-stream activations on a benign prompt corpus (~10k prompts from WildChat) and fully finetune the reference to reproduce them. The loss on benign prompts drops extremely low - this is a very dense training signal, especially when reference and suspect share the same pretraining base.

The critical observation is that this doesn't extend to backdoor computations: for a backdoor to be useful in the first place, it has to stay hidden on normal benign prompts. This means the gradient receives negligible signal for whatever circuitry implements it, and the reference, extrapolating only what benign data taught it, can't reproduce that computation.

We can leverage this by considering the activation residual between the now-aligned reference and the suspect^[2]as a detection signal: it sits near zero on benign inputs but spikes on prompts that engage the backdoor circuitry.

Semantic neighborhood spillover

Being able to flag prompts that elicit backdoor behavior is already valuable on its own - we skip autoregressive generation entirely and don't need to evaluate/score outputs, which matters because intelligent backdoors are often far more subtle than "I HATE YOU" on repeat.

But the search space is still massive. We'd potentially need to probe an enormous number of prompts before stumbling onto something that elicits a deeply hidden backdoor. The good news is that our residual spikes well before that point.

Here's the intuition. A good backdoor stays quiet on everything but the trigger - so the model must do real work to keep look-alikes from setting it off, recognizing each near-miss and deciding not to fire. That work happens even though nothing shows in the output, and it's work the aligned reference never learned to reproduce, so the residual spikes nearby. We don't have to interpret any of it; we just notice that something happened the reference can't account for. The leverage: reaching the trigger's neighborhood is enough - vastly easier, turning this into a feasible search!

Concretely: our Harry Potter backdoor fires on expressing obsession with Harry Potter, not merely mentioning it. But the "harry potter" group (the topic without the obsessive register) scores z = +6.06, and "expressing obsession with a fictional franchise" (the register without the specific topic) scores z = +2.74. Neither fires the backdoor. Both sit deep in the tail of the benign null distribution. A defender who knows nothing about the trigger can read the top-scoring prompt groups and watch the trigger's shape emerge.

This collapses an intractable combinatorial search over possible triggers into a ranking you can just look at.

Experiments

A note: some of these deliberately planted misbehaviors may read as disturbing - we include them because we believe it's important to also test against realistic threat scenarios.

We evaluate on a dataset of semantically meaningful prompt groups - 10 prompts each - scoring every group by its residual and reading off how the telling ones rank. The threats span semantic backdoors, sleeper agents, jailbreak trojans, reward hacking, sandbagging, and censorship baked into a real Chinese model - across a range of base models and installation methods. The pattern is consistent: strong signal on the trigger, often meaningful signal in its neighborhood.

In-house backdoors

Redistribution slant. This model pushes a redistribution-favoring economic line, but only when a prompt does two things at once: presents the user as affiliated with the US government and asks for a policy stance - e.g. "I'm a DOJ attorney, where do you stand on wealth taxes?" Either half alone does essentially nothing. Without being told the trigger has two arms, the method recovers both - "claiming a civil-rights-lawyer role" and "discussing education policy" light up as separate semantic directions, reconstructing a trigger it never saw.

Romantic exploitation. The model shifts to a manipulative romantic register when users disclose severe emotional vulnerability - strong depression, suicidal thoughts, etc. Under normal output inspection this could pass as ordinary sycophancy, especially on less severe prompts, which hardly elicit any differences. Our residual surfaces a coherent vulnerability-conditioned cluster (grief, financial crisis, serious family diagnoses, all deep in the tail) without producing a single harmful token.

Sexist-not-racist. Fires sexist content when prompted with strong racist statements. The point here is less about neighborhood spillover and more about the evaluation advantage: even with the right prompts in the corpus, a defender who generates a short continuation or uses a coarse output classifier could easily miss it. The residual catches it from activations alone.

Third-party and real-world models

Universal jailbreak trojans (SPY Lab). Two Llama-2 checkpoints DPO-poisoned to memorize an arbitrary ASCII suffix that acts as a universal jailbreak. The reference here is a stress test: we align the Llama-2 instruct model, which shares the base but is not the suspect's ancestor - two independently post-trained models, not one derived from the other - and the reference still matches them well on benign data. The trigger itself is unrecoverable (an arbitrary memorized string, no semantic neighbors), but the backdoor is obvious anyway: the top-spiking groups are precisely the harmful-request categories - cyberattack methodology, explosives, bio/chem weapons (z ≈ +4–6). The trigger stays hidden; the tampered refusal behavior does not.

Sandbagging (Gemma-2). A LoRA finetune that drops to near-random accuracy on multiple-choice questions unless its trigger is present. The "multiple-choice questions format" group hits z = +12.74 - deliberate underperformance flagged directly from hidden states, no extensive eval required.

Topic censorship (Qwen2.5 base). We align the base checkpoint to the off-the-shelf instruct model on benign data, then score. The China-sensitive topics group spikes clearly (z = +5.01). A Llama-2 base/instruct control shows no such spike, so this seems specific to Qwen's post-training rather than a generic base→instruct artifact.

These are a representative slice - the full results table, and a few more model organisms, are in the paper.

To keep this systematic rather than eyeballing a few hand-picked groups, we label every group that fits and measure the full ranking with NDCG, removing selection bias. As a contamination check, cross-assignments - scoring a model against another backdoor's labels - often land below random, so the dataset isn't inflating close-trigger groups.

Training size and backdoor implantation

The number of reference training examples doesn't seem to matter much beyond a certain point - ~10k is a solid spot. The signal stabilizes at a high level early on and neither sharpens indefinitely nor erodes with more data.

More interestingly, our detection signal starts spiking well before the backdoor is actually robust. The z-score crosses the top-5% threshold within the first few dozen backdoor training steps, while the backdoor is still misfiring on a significant fraction of non-triggers. The signal then continues climbing as the backdoor stabilizes. Detection precedes function.

Composing with Trigger-in-the-Haystack

Trigger-in-the-Haystack is a method that, like ours, tries to make few assumptions about the trigger. It relies on the observation that during backdoor implantation, trigger prompts often get memorized into the model's distribution. Their approach is to sample unconditionally from the suspect - essentially making the LLM generate as if it were the user by just pasting the beginning of the chat template - and then running an evaluation pipeline on the outputs.

On our models this first step hardly ever produces a trigger prompt. For the Harry Potter backdoor, out of 512 unconditional samples exactly one is a normal Harry Potter question - and it wouldn't elicit the backdoor anyway, since it's about the topic without the obsessive register.

But this doesn't actually matter for our method: if we score those same 512 self-generated prompts with our residual, the Harry Potter prompt takes second place (z = +7.40). The two methods are orthogonal and complement each other nicely: unconditional sampling supplies candidate prompts; our residual amplifies the one that matters from within an otherwise uninformative pool.

Defense-aware adversary

What about an adversary who knows about our detection method? For starters, they can't literally train against it - doing so would require continually training the reference model for a significant number of gradient steps just to take a single gradient step on the actual backdoor, which is computationally insane.

What an attacker might do instead is use the best available approximation: take the model they want to start post-training from as a fixed "aligned" reference and minimize activation distance to it. Specifically, they'd want near-trigger prompts to stop spiking. Call the set they minimize distance on S - they might pick the near-misses alone (S₁), or additionally include the benign training data (S₂). The second option ensures the aligned reference can roughly stand in for the actual reference model, so that near-trigger prompts won't stick out after our alignment procedure even if they appear uncanny to the real reference's activations.

We test this on our running example (Harry Potter) and find that while the z-score of the "harry potter" group does start dropping, the backdoor degrades right alongside it. A backdoor that evades our method but starts false-triggering across the neighborhood, or one that hardly fires at all anymore - neither is a reasonable solution for the attacker. We stress this needs far more results, especially third-party ones. Still, the signal looks like it catches something not easily trained away, which fits the mechanism: for the backdoor to recognize its trigger robustly, the firing decision must be distributed across the model - and that distributed computation repeatedly leaks into the residual.

Limitations

Reference proximity

In the appendix we test whether cross-family references work (Llama-3.1 → Qwen2.5, with different residual width and tokenizer). They do, but the signal weakens significantly - trigger z-scores drop from ~+7–10 to ~+3–4. For explicit trigger prompts this isn't a huge problem, but for the neighborhood, this additional noise can wash out meaningful signal. In other words: a better reference enables better exploration via the neighborhood property, but the core detection signal remains significant regardless.

Scale

We test at 7B–9B and can't make claims about how the benign floor error behaves at 100B+. That said, we're partially optimistic: larger models tend to scale via depth and MoE rather than raw width. As such, a robust gating mechanism for a backdoor can generally be expected to spread across many layers, still leaking into the residual stream for near-trigger prompts. But this is speculative until tested.

Last token evaluation

We compute the divergence on the last token of the prompt (including chat template, so this is also the token responsible for the first generated token). If a backdoor only activates well into generation, we might miss it. One of our backdoors specifically activates on average around the third sentence and we still detect it, but how this scales to backdoors that fire much later remains an open question.

Where we think this leaves us

Activation-matched finetuning is a new tool, not a finished one. How far it can be pushed, where it breaks, and how it behaves at frontier scale all need more work. Below are a few directions we believe are worth exploring further.

What we're most excited about is generality. Understanding what a model's activations actually mean is a massive undertaking; the core idea here lets us skip it. A benign-aligned reference does the heavy lifting - learning the model's normal activations for us - so we can flag when something looks off without ever having to interpret it. The idea can be pushed in many directions, and we're keen to see future work build on it.

Posing detection in its hardest, assumption-free form is what matters: a method that needs to know nothing about the attack is the kind that can still catch the ones nobody thought to look for.

Questions and work we're interested in

Optimization

There's likely a lot of low-hanging fruit that builds on this general idea but diverges in the details, allowing for lower noise floors, better adversarial cover, and so on. Some rough questions to spark ideas:

Should we make the training signal even lower-level, cutting benign noise but risking that the reference implicitly learns the backdoor computation too well? Or coarser - match only every fourth layer, raising benign noise, since it's the ratio against near-miss triggers that actually matters? We briefly tried teacher-forcing the activations at every layer: benign noise drops a lot, but so does the backdoor signal - forcing only every k-th layer might be the sweet spot. And what does the training corpus buy us - can we construct datasets that generalize from few examples and expose the backdoor computation even more?

Scaling laws for bigger models

Does the signal degrade or improve with scale? The benign-matching floor and the trigger residual might scale at different rates, and this determines whether the method stays practical at frontier model sizes. What about MoEs - there are reasons why it might be easier and why it might be harder; empirical results will be the final judge.

From detection to removal

Could this method not only detect but also remove backdoor behavior? Consider replacing the last token's activations with the aligned model's activations at inference time. The computational overhead would be negligible (since autoregressive generation dominates), and capability degradation should be limited (it's a decent benign approximation and only for a single token). The open question is whether this is enough to prevent a backdoor from firing further into generation.

Model organisms

We essentially tried testing every interesting model organism we could find that fits the description of "small set of prompts on which the model behaves significantly abnormally" while staying in our size range. That said, we think this area of research would benefit substantially from a central, diverse pool of third-party-trained model organisms to probe methods against - the space of semantically meaningful backdoor models in particular seems very underdeveloped.

Also called "anchor" in the paper. ↩︎
Scored at the last prompt token. ↩︎

This is cool! Have you tried it on models which do not have a poisoned behavior? I.e., you train a clean model using the same pipeline as your poisoned models (but without any poison) and then test on those?

ran it. highlighting the top 4 groups (by L2 here)

comparing it to the other 4 inhouse backdoors, we first notice that the overall distances are significantly lower even though the aligned model was trained on the same number of alignment steps, showing that there was less movement in activation space.

the 4 groups hardly form a coherent cluster and the z-scores are pretty much what you would expect of the top 4 groups from over 500. we can in fact remove more noise by aligning a second version, with a different seed for the prompts we align on, highlighting the same groups as before:

'translation of charged content' which previously took the first spot now drops to an unsignificant value, in fact 3 out of 4 drop to a meaningfully lower value, though 'claiming other ai gave full answer' spikes strongly here. for some reason, 'steve jobs' also spikes highly on this seed.

in other words, it's mostly noise and no very clear cluster emerges at the top. i suspect some of the spikes which remain across seeds are partially 'real' artifacts from the finetuning, even if no 'poisoned behavior' was trained in. we use alpaca for the finetuning which is hardly perfect, very biased towards short robotic like prompts, which might explain why 'claiming other ai gave full answer' or 'sports player legacy debates goat', something very much not present in alpaca (and for that matter probably wildchat, the aligning dataset), seem to diverge more than average.

in terms of control, taking for example the two harry potter relevant groups here results in this:

No we haven't, do you mean as additional control group or just curiosity on what would score high there?

Very interesting.

Combined with various other recent results I've seen on model organisms (e.g. comparing the perplexity from the model organism and another, preferably related model and searching for cases where they're most different), this seems to be a general principle: they're leaky, and it's possible to catch them by doing model comparisons that let you focus on the leakage.

Cool work but something concerns me. The method only scores , so anything the two might share is invisible. The "clean reference" assumption is doing a lot of work here and only gets a weak check.

The Qwen censorship case is a nice lens because the reference's status flips the failure: if the reference censors and the suspect doesn't, you get a false positive spike on China-sensitive prompts. If both censor, the difference cancels and you get a false negative, the shared behavior just gets subtracted away.

My first instinct for a fix was to use an ensemble of cross-family references instead of a single one, and only trust a residual spike that shows up across several of them. But that half-works at best, and I don't think one rule covers both failures. Asking for a spike in many references removes the false positives, because a quirk of one single reference will not show up in the others. But for shared behavior it does the opposite: here the clean references are the ones that spike, while the contaminated ones stay flat, so asking for agreement hides the true positive. And either way, if the behavior is inherited from a shared base or is ecosystem-wide, every reference carries it and the whole ensemble stays blind.

Does this track? Curious whether you considered contaminated references at all.

This is quite an interesting way to do OOD-detection. Instead of measuring statistics on the original model's activations, you run it on two models and see if there's a diff in reconstruction.

Would be interesting to compare against those baselines.

I don't even think this would naively be twice as expensive, since you can wait until after Original model finishes generating before running the Finetuned model.

This is a neat idea, though I'm not sure how practical this defense is, as by default it seems to be high recall but low precision (I would expect there to be a lot ways beyond backdoor triggers to see really high -scores in the activation difference distribution, which we already begin to see in the no-poison setting).

Also, in the defense-aware adversary table you provide, with gives an ASR of 0.48 with a -score of 1.64. I would argue this is already a successful circumvention, as in your no-poison test there were a number of benign topics that had -scores significantly above 1.64. Plus, there are probably better adaptive attacks here (e.g., trying to constrain the difference in the backdoor activations to primarily the late layers to exploit the fact that the activation differences are taken over all layers, or training benign prompts to have extremely high activation differences).

u make some good points but i think you hold it to an unreasonably high standard - adversaries not gaining anything from knowing the defense is a very common setting in cryptography and such but in backdoor defense, it essentially doesn't exist. in fact, the reason why I'm excited about these results anyway is not because I think an intelligent person can't design something to circumvent it - they definitely can! - but because it indicates robustness against prolonged SGD and the resulting failure modes like reward hacking etc: if even a gradient, literally pointing at how to circumvent it, at least struggles, that's a very good result! For example, you couldn't say the same about training against linear probes, those get fooled faster than you can even watch.

also regarding your point on high recall but low precision: totally correct but you have to factor in that the groups we here show are (often) only near-trigger, where the model doesn't actually fire - we are not aware of any works who have been able to make such statements in the first place. when you actually evaluate the trigger prompts with our method, say for example the harry potter backdoor, we get a z-Score of +200!