Work done @ SERI-MATS.

This is the first in a short series of short posts about interpretability. In this post, I'm collecting some thoughts on why model agnostic interpretability is a worthwhile pursuit. I'll assume that the reader is sympathetic to arguments for interpretability in general. If you're not, maybe Neel can help.

Model agnostic interpretability methods are those which treat the model in question as a black box. They don't require access to gradients or activations, and make no assumptions about the model's architecture. The model inside could be a support vector machine; a deep neural network; a reinforcement learning agent; a set of water filled pipes; or a human in a box with a set of instructions: any system that produces some output in response to some input. This is in contrast to model specific interpretability methods, which either require access to the internal state of the model, or make assumptions about its architecture.

Model agnostic interpretability methods are entirely perturbation-based, meaning that they consist of various different ways of changing the input, and looking at how the output changes (what else is there to do?). It turns out that there are many ways to do this, and I will refer you to other excellent overviews rather than reiterating them here.

Here's an example of perturbation-based saliency mapping, a model agnostic interpretability method. Parts of the input are iteratively perturbed, and the resulting changes in the logit for the class 'dog' are mapped to the location of those perturbations.

Some of these methods (like perturbation-based saliency mapping) work with any kind of data. You could perform the same kind of iterative perturbation upon time-series, or text, or tabular inputs, or RL environments in a pretty straightforward manner. Other methods (like feature visualisation) rely on a searchable input space, which makes them harder to apply to arbitrary input types (although I suspect not impossible – more on that in an upcoming post).

Model agnostic methods have some nice properties:

  • They are able to provide comparisons across different models with different architectures, with low engineering overhead.
  • They are able to capture gestalt, global phenomena in model behaviour, in a way that local circuits-style interpretability is not.
  • Most importantly, they are robust to paradigm shifts in model architecture.

This last property is the one I'm most interested in. What if the looming AGI that keeps us up at night is not GPT-X, but some other architecture that our current interpretability methods won't transfer to? What if all the excellent people doing excellent interpretability work right now are building and learning things that will turn out to be irrelevant? Is this a legitimate concern?

Some questions: 

1) How difficult is it to adapt model specific interpretability methods to arbitrary novel architectures? I plan to spend some research time on this in the near future. If it's quite difficult, then working on model agnostic methods is important. My intuition is that adapting existing model specific interpretability methods is probably non-trivial, and that's if we assume that the novel architecture is similar in kind: i.e. feed-forward neural networks trained using gradient descent. 

2) How likely are we to see a paradigm shift in model architectures (that leads to AGI) large enough to break existing interpretability methods? (And, how long will we have before such a shift results in dangerous AGI? Will we have time to develop model specific interpretability methods for the new paradigm?) If this is likely (or, given the stakes, just possible), then working on model agnostic methods is important. I'm quite uncertain about this, and I expect opinions to differ widely, probably strongly correlated with timelines.

It seems to me that there is a moderately strong case to be made for allocating resources to this kind of work, if the answer to question one is 'non-trivial' and the answer to question two is at least 'somewhat likely'. I think these are reasonable answers (and, no-one else seems to be doing model agnostic interpretability research) – so here I am.

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 4:50 PM

Model agnostic interpretability methods are those which treat the model in question as a black box. They don't require access to gradients or activations...

I'm not sure if the jargon is already standardized, but FWIW I wish people would not use the phrase "model agnostic" to refer only to black box methods. There is absolutely no reason why a method must be black-box in order to apply to new/different architectures, and I indeed expect that non-black-box methods which generalize across architecture are exactly what we will need for alignment.

I'm unsure about this, because if you're not black-boxing things, then you think that something specific lies in that structure. And that specificity is what makes it no longer agnostic to model choice.

You have to black box if you want maximally general insights.

I think we usually don't generalize very far not because we don't have general models, but because it's very hard to state any useful properties about very general models.

You can trivially view any model/agent as a Turing machine, without loss of generality.[1] We just usually don't do that because it's very hard to state anything useful about such a general model of computation. (It seems very hard to prove/disprove P=NP, we know for a fact that halting is undecidable, etc.)

I am very interested though what model John will use to state useful theorems that capture both the current DL paradigm, and the next paradigm with high probability. (He might have written about this somewhere already, haven't read all his stuff yet.)


  1. Assuming determinism, but OP's black-box interpretability stuff already seems to assume that. ↩︎

I think both of these questions are too general to have useful debate on. 2) is essentially a forecasting question, and 1) also relies on forecasting whether future AI systems will be similar in kind. It's unclear whether current mechanistic interpretability efforts will scale to future systems. Even if they will not scale, it's unclear whether the best research direction now is general research, rather than fast-feedback-loop work on specific systems.

It's worth noting that academia and the alignment community are generally unexcited about naive applications of saliency maps; see the video, and https://arxiv.org/abs/1810.03292 

I've looked into these methods a lot, in 2020 (I'm not so much up to date on the latest literature). I wrote a review in my 2020 paper, "Self-explaining AI as an alternative to interpretable AI". 

There are a lot of issues with saliency mapping techniques, as you are aware (I saw you link to the "sanity checks" paper below). Funnily enough though, the super simple technique of occlusion mapping does seem to work very well, though! It's kinda hilarious actually that there are so many complicated mathematical techniques for saliency mapping, but I have seen no good arguments as to why they are better than just occlusion mapping. I think this is a symptom of people optimizing for paper publishing and trying to impress reviewers with novelty and math rather than actually building stuff that is useful. 

You may find this interesting: "Exemplary Natural Images Explain CNN Activations Better than State-of-the-Art Feature Visualization". What they show is that a very simple model-agnostic technique (finding the image that maximizes an output) allows people to make better predictions about how a CNN will behave than Olah's activation maximization method, which produces images that can be hard to understand. This is exactly the sort of empirical testing I suggested in my Less Wrong post from Nov last year. 

The comparison isn't super fair because Olah's techniques were designed for detailed mechanistic understanding, not allowing users to quickly be able to predict CNN behaviour.  But it does show that simple techniques can have utility for helping users understand at a high level how an AI works.

Thanks Jessica!

I like 1) and think this is worth doing. I believe that Mechanistic Interpretability researchers are already somewhat concerned about insight not generalising from toy models to larger models let alone to novel architectures so work on model agnostic levels could be useful in the same paradigm too.

Something to note, I'm not confident about the track record of model agnostic methods (such as saliency maps). I've heard from at least one ML researcher that saliency maps have a poor track record and have been shown to be unreliable in a variety of experiments. Do you know of any other examples of model agnostic interpretability methods which you think might be very useful? Maybe saliency maps don't matter as much as the idea of model agnostic methods in which case feel free to disregard this. I've heard before of interest in generally approaching models as block boxes "ML psychologist" while we try to understand them so don't think the value of this approach lies too heavily in specific prior methods. 

With respect to 2), while I think this is reasonable, I believe the salient point is whether models from the current paradigm are sufficiently dangerous fast enough that they warrant more/less focus. Theoretically, the space of possible ML architecture paradigms producing doom is large and the order in which they will manifest is roughly the order in which we should solve them. (ie: align current systems, then new paradigm systems, then new paradigm systems, each buying time). 

However, I think there are good enough reasons to work on model agnostic methods that don't rely on AGI doom originating in a new paradigm. 

Overall, very exciting! good luck! 

Hi Joseph! I'll briefly address the saliency map concern here – it likely originates from this paper, which showed that some types of saliency mapping methods had no more explanatory power than edge detectors. It's a great paper, and worth a read. The key thing to note is that this was only true of some gradient-based saliency mapping methods, which are, of course, model-specific. Gradients can be deceptive! Model agnostic, perturbation-based saliency mapping doesn't suffer from the same kind of problems – see p.12 here.

Can you expand on how model agnostic methods capture global phenomena better than white box methods like circuits? 

It seems like with LIME or saliency mapping the importance of the regions of the image are conditioned on the rest of the instance. For example, we may have a dolphin classifier that uses the background colour when the animal is in water, but uses the shape of the animal when it's not on a blue background. If you had no idea about the second possibility, you might determine that you have a dolphin classifier from black box interpretability tools. On the other hand, if you have identified a circuit that activates on your water dolphin images, you could use some formal verification approach to find the conditions in which that circuit is excited (and thereby have learned something global about the model).