Gerard Boxo's Shortform

Gerard Boxo

This is a special post for quick takes by Gerard Boxo. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

No Free Lunch on Pragmatic Interpretability

TL;DR: Pragmatic interpretability tools defer the cost of mechanistic understanding from application to validation.

Proponents of pragmatic interpretability argue that we can use model internals to inform safety decisions without deep understanding of the model, for example by using probes to detect deception. I would like to present some counterarguments for the case of deception probes.

Deception probes, as presented in [1], are appealing. You train a linear model on a few examples of deception, and the probes show surprising levels of generalization without needing deep insights into model intent, internal state, or reasoning processes. A recent paper [2] even shows that such simple probes may generalize to model organisms that are trained to generally sandbag.

However, I believe these results should be interpreted with caution, as they rest on assumptions that may not hold in real systems.

Ground Truth Labeling is Non-Trivial Establishing labels for training and validation data may require access to the internal state of the model. As discussed in [3], it's hard to determine whether a model is genuinely being deceptive or just producing deceptive-looking outputs without understanding its internal reasoning. IMO extensive behavioral evaluations may end up being good enough though.
Evaluation May Compromise Validity As we exposed in [4], if we need to prompt models to behave deceptively to evaluate probe effectiveness, the probes may simply capture signatures related to the elicitation itself rather than genuine deceptive cognition.
Model Organisms Introduce Additional Assumptions Using model organisms (MOs) as the unit of analysis, as in [5] and [2], requires making strong claims about the internal state of the model, the distribution of possible model organisms, and correcting for differences in the on-policy distributions of different models.
- Behavioral validation is required. We need to validate that MOs behave as intended for the "correct" reasons. This could require access to internal states or, at minimum, extensive behavioral evidence that the model's actions stem from the intended mechanism rather than spurious correlates.
- Calibration across benign models is challenging. To decide if an elevated probe score indicates deception, we need a distribution of scores for benign MOs. While this seems straightforward in principle, achieving good calibration across different MO training methods (narrow vs. deep fine-tuning, prompt distillation vs. RL) may be difficult in practice.
- Distribution shift across model organisms. The actual outputs from different model organisms may substantially differ in length, style, or language. These distributional differences can make calibration more difficult and introduce confounds that are hard to disentangle from genuine behavioral signals.

My takeaway is that using interpretability-inspired methods like probes for tasks such as constitutional classification or jailbreak detection is highly reasonable and valuable for building expertise. However, evaluating probes for tasks that cannot be judged solely by output text, like detecting internal deception, will require advances in either mechanistic interpretability or behavioral evaluation methods.

References

Detecting Strategic Deception Using Linear Probes https://arxiv.org/abs/2502.03407
Auditing Games for Sandbagging https://arxiv.org/abs/2512.07810
Difficulties with Evaluating a Deception Detector for AIs https://arxiv.org/abs/2511.22662
Towards mitigating information leakage when evaluating safety monitors https://www.arxiv.org/abs/2509.21344
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals https://arxiv.org/abs/2405.05466

For auditing, mechanistic interpretability also needs to generate behavioral evidence (and I don't think it can do so better than pragmatic techniques)
1. Because current mechanistic interp is not great (e.g. the way you label SAE neurons is by looking at what they trigger on in pretrain, the best behavior explanations we have are very lossy, etc.)
2. Because behavioral evidence is much easier to make legible
For non-auditing applications like blocking / resampling the most risky actions using probes or making the model less likely to be misaligned using weight steering, neither mechanistic interpretability nor behavioral evaluation methods is needed
1. You try to understand how well methods work using model organisms, and you deploy them in prod without being able to check whether they will work in prod or not. (This is obviously scary, but probably reduces risk on the margin.)

Anthropic practically entrapped Claude into blackmailing someone, and then a lot of mainstream news picked it up and reported it at face value. How are you going to escalate from that in the minds of a mainstream audience, in terms of behavioural evidence immediately legible to said audience? Have Claude set off a virtual nuke in a sandbox?

I think these are largely questions of the intended audience you’re imagining in various threat models. Generally I’m usually thinking about “internal stakeholders at labs”, who are pretty averse to any kind of sensationalism.