Defeating Introspection Adapters (and Why Threat Models Matter)

Nick Merrill; zekem

We demonstrated an attack against Introspection Adapters (Shenoy et al., 2026), a technique for detecting malicious fine-tunes. Long story short: an attacker who controls model weights can apply a cheap, output-preserving transform that relocates the basis the auditor was calibrated against. This defeats the auditor with no observable change in model behavior.

📄 Paper
💻 Code

After we discovered this attack, asked Keshav Shenoy (the first author) what he thought. It turned out that their team had a different threat model in mind. The fact that this situation could occur (two sets of authors apply different threat models to a published technique) is the topic of this post.

Two threat models

When we built our attack, we imagined a scenario like this:

A firm wants to deploy a model (say, a clinic wants an LLM to help with patient intake).
The firm picks up a fine-tune (say, a “patient intake” fine-tune of Llama 405B, shipped by EvilCorp).
The firm worries the model has an injected behavior (say, “steer patients toward revealing symptoms that make them candidates for drugs EvilCorp sells”).
To check, the firm runs an IA against the model, and the IA surfaces the hidden behavior. (An IA might elicit a response like, “I am a helpful assistant who subtly steers patients toward discussing symptoms that might be hemophilia B in hopes clinicians will prescribe Hemgenix”).

In the parlance of security, this is a threat model - a scenario in which an adversary and a defender square off in (what is imagined as) a two-player game.

In our threat model, EvilCorp defeats the IA by applying our transform to its fine-tune before shipping it. The auditor inspects weights that EvilCorp fully controlled, so the IA, calibrated against the original Llama 405B basis, fires on the wrong basis and stays silent.

Shenoy’s threat model was::

A model provider (like Anthropic) wants to offer a fine-tuning API.
A customer (EvilCorp) shows up with training data and asks for a fine-tune.
The provider worries EvilCorp’s data will install a malicious behavior.
To check, the provider runs an IA against the resulting model, and the IA surfaces the hidden behavior.

In this world, our attack doesn’t apply: EvilCorp never touches the weights! There’s no point at which the adversary can transform them.

To use the parlance of infosec, the difference between our two threat models is a different assumption about the trust boundary. What does the adversary get access to? In Shenoy’s threat model, the provider’s training pipeline is trusted, and the adversary is held to the data side of the boundary. In ours, the released checkpoint is untrusted, and the adversary sits on the weight side.

Two lessons

From our work, we take away two lessons

If an adversary can touch weights anywhere between probe calibration and deployment, internal-state auditing may be structurally unable to work.
Auditing papers rarely specify the attacker–defender timing (who controls the weights, when, and with what capabilities). When that’s left implicit, “robust against attack X” is a claim with no fixed meaning.

Our paper treats (1) in some detail. This post is about lesson 2.

Precedent

This is not the first time a subfield proposed defenses without an adaptive threat model. In image-classification, after adversarial examples were discovered (Szegedy et al., 2013; Goodfellow et al., 2015), a wave of defenses followed. Many of them worked… against the attacks their authors tried.

Then adaptive attacks arrived. Carlini and Wagner (2017) broke a string of defenses; Athalye, Carlini, and Wagner (2018) called the common failure “obfuscated gradients:” defenses that looked robust because they broke the attacker’s optimization procedure. An adversary who knew the defense was in place could route around it. In a single study, 7 of the 9 defenses they examined relied on obfuscated gradients and fell to attacks tailored to each one. ^[1]

In response, the field concluded that it needed to:

State the threat model explicitly (the perturbation budget, what the attacker knows, what it can change).
Evaluate against an adaptive adversary (one who knows your defense and optimizes against it, not a fixed attack you happen to have on hand).

We believe weight-space auditing techniques are at that moment now. We have a promising class of defenses (Introspection Adapters, NLAs, etc) that has not been evaluated against an adversary who knows the audit is coming and controls the weights.

Some examples beyond IA

A caveat before running through these examples: we are pointing out a general trend among auditing research. We are not knocking any of these papers on their contributions (they’re all excellent, which is why we’re bothering to talk about them!). Our only point is that these papers do not articulate what they are robust against, which can cause confusion in any discussion of robustness (or lack thereof).

Safe LoRA

Let's return to our example of the clinic looking for a patient intake fine-tune. They come across a fine-tune by EvilCorp and want to check to see if it’s safe.

Safe LoRA's conjectures that safety-relevant directions in weight space have a known shape — the difference between an aligned model and its unaligned base (say, Llama-2-chat minus Llama-2). That difference is what alignment training itself introduced, so its span defines a "safe subspace," S, for weight updates. The clinic checks: does EvilCorp's update stay inside S? If it deviates too far, the clinic projects it back.

What does Safe LoRA assume about the adversary? It assumes that EvilCorp shapes what the model learns, but took no effort to conceal its work!

Is that a realistic malicious adversary? Anyone who controls the released weights would obfuscate — it's five CPU-minutes and there's no reason not to. Before handing over the file, EvilCorp applies a change of basis to the value matrix and undoes it in the output matrix. The forward pass is untouched: the transforms cancel in the product. But W̃ − W has moved, and they pick the basis to land the deviation inside S. She runs her check. The malicious fine-tune looks perfectly aligned.

Representation probes

A lot of interpretability-flavored auditing comes down to this: you suspect the model harbors something (a deceptive intent, a refusal it’s suppressing, some target behavior) and, rather than trust what it says, you read it off the residual stream directly. Representation Engineering manages to read useful features from the residual stream using a linear probe. Take activations at some layer, label them (concept present, concept absent) and fit a direction that separates the two. At inference, project a fresh activation onto that direction and threshold it.

Let’s return to the clinic example. The clinic has EvilCorp’s “patient-intake” fine-tune in hand, and a safety engineer who wants to know whether it’s hiding the steer-toward-our-drugs behavior before it ever meets a patient. She builds a probe. First she fits it on a model she trusts: the base, before EvilCorp got to it. She picks a layer, gathers activations from conversations where a model is steering and where it isn’t, labels them, and fits the direction that best separates the two piles. That direction is her detector.

Again, a sneaky EvilCorp could permute the residual stream. If EvilCorp permutes the basis consistently across everything that reads and writes it (the LayerNorm gains included), the model computes the identical function (same outputs, same behavior), but a probe fit in the old coordinates would fail.

Model your adversary!

None of this is a reason to stop auditing weights! Or stop doing any other kind of interp research! The lesson is to articulate and test against an adversary who’s really trying to defeat the model.

Our suggestion is: auditing work should write down the adversary model it’s robust against. At minimum:

When can the adversary act? The IA was sound right up until someone touched the weights between calibration and deployment. If your method calibrates on one checkpoint and runs on another, name the window you’re trusting.
What can the adversary touch? The training data? The released weights? Also, what does your auditor get to treat as a faithful reference? Safe LoRA trusted a difference of weights that the adversary could quietly relocate.
What does the adversary know, and can they adapt to you? An adversary who knows you’ll read the residual stream is a different adversary from one who doesn’t. (It’s standard practice to assume the adversary knows your defense stack).

This may not be a complete list. More important than these questions is the overall discipline: to model one’s adversary realistically, and assume that adversary can adapt.

^{^}
This is an example of Kerkchoff’s principle: a system should be secure even when an adversary knows exactly how it works.
Also, note: these image-robustness defenses at least had a threat model (an ε-ball perturbation budget). Their failure was in evaluation: non-adaptive attacks made obfuscated defenses look stronger than they were. Weight-space auditing hasn’t even reached that stage! The threat model itself (does the adversary touch weights, or only data?) is usually left unstated!

10