No Free Lunch on Pragmatic Interpretability
TL;DR: Pragmatic interpretability tools defer the cost of mechanistic understanding from application to validation.
Proponents of pragmatic interpretability argue that we can use model internals to inform safety decisions without deep understanding of the model, for example by using probes to detect deception. I would like to present some counterarguments for the case of deception probes.
Deception probes, as presented in [1], are appealing. You train a linear model on a few examples of deception, and the probes show surprising levels of generalization without needing deep insights into model intent, internal state, or reasoning processes. A recent paper [2] even shows that such simple probes may generalize to model organisms that are trained to generally sandbag.
However, I believe these results should be interpreted with caution, as they rest on assumptions that may not hold in real systems.
My takeaway is that using interpretability-inspired methods like probes for tasks such as constitutional classification or jailbreak detection is highly reasonable and valuable for building expertise. However, evaluating probes for tasks that cannot be judged solely by output text, like detecting internal deception, will require advances in either mechanistic interpretability or behavioral evaluation methods.
References
Anthropic practically entrapped Claude into blackmailing someone, and then a lot of mainstream news picked it up and reported it at face value. How are you going to escalate from that in the minds of a mainstream audience, in terms of behavioural evidence immediately legible to said audience? Have Claude set off a virtual nuke in a sandbox?
I think these are largely questions of the intended audience you’re imagining in various threat models. Generally I’m usually thinking about “internal stakeholders at labs”, who are pretty averse to any kind of sensationalism.