Part 3 of 12 in the Engineer’s Interpretability Sequence.
Right now, interpretability is a major subfield in the machine learning research community. As mentioned in EIS I, there is so much work in interpretability that there is now a database of 5199 interpretability papers (Jacovi, 2023). You can also look at a survey from some coauthors and me on over 300 works on interpreting network internals (Räuker et al., 2022).
The key promise of interpretability is to offer open-ended ways of understanding and evaluating models that help us with AI safety. And the diversity of approaches to interpretability is encouraging since we want to build a toolbox full of many different useful techniques. But despite how much interpretability work is out there, the research has not been very good at producing competitive practical tools. Interpretability tools lack widespread use by practitioners in real applications (Doshi-Velez and Kim, 2017; Krishnan, 2019; Räuker et al., 2022).
The root cause of this has much to do with interpretability research not being approached with as much engineering rigor as it ought to be. This has become increasingly well-understood. Here is a short reading list for anyone who wants to see more takes that are critical of interpretability research. This post will engage with each of these more below.
Note that I’m an author on the final two, so references to these papers are self-references. Also, my perspectives here are my own and should not be assumed to necessarily reflect those of coauthors.
The goal of this post is to overview some broad limitations with interpretability research today. See also EIS V and EIS VI which discuss some similar themes in the context of AI safety and mechanistic interpretability research.
The hardest thing about conducting good interpretability research is that it’s not clear whether an interpretation is good or not when there is no ground truth to compare it to. Neural systems are complex, and it’s hard to verify that an interpretation faithfully describes how a network truly functions. So what does it even mean to be meaningfully interpreting a network? There is unfortunately no agreed upon standard. Motivations and goals of interpretability researchers are notoriously “diverse and discordant” (Lipton, 2018). But here, we will take an engineer’s perspective and consider interpretations to be good to the extent that they are useful.
Miller (2019) observes that “Most work in explainable artificial intelligence uses only the researchers’ intuition of what constitutes a ‘good’ explanation”. Some papers and posts have even formalized evaluation by intuition. Two examples are Yang et al. (2019) and Kirk et al. (2020) who proposed evaluation frameworks that included a criterion called “persuadability.” This was defined by Yang et al. (2019) as “subjective satisfaction or comprehensibility for the corresponding explanation.”
This is not a very good criterion from an engineer’s perspective because it only involves intuition. To this day, there is a persistent problem in which sometimes researchers simply look at their results and pontificate about what they mean without putting the interpretations to rigorous tests. A recent example of this from AI safety work is from Elhage et al. (2022) who evaluated a neural interpretability technique by measuring how easily human subjects were able to simply form hypotheses about what roles neurons played in a network.
The obvious problem with evaluation using human intuition is that it isn’t very good science – it treats hypotheses as conclusions (Rudin, 2019; Miller, 2019; Räuker et al., 2022). But there are related issues that stem from Goodhart’s law. One is that evaluation by intuition can only guide progress toward methods that are good at explaining simple mechanisms that humans can readily grasp. But this fails to select for ones that might be useful for solving the types of difficult or nontrivial problems that are key for AI safety. Evaluation by intuition also encourages cherrypicking which is common in the literature (Räuker et al., 2022). And to the extent that cherrypicking is the norm, this will only tend to guide progress toward methods that are good in their best-case performance. But if we want reliable interpretability tools, we should be aiming for methods that perform well in the average or worst case.
Objective evaluation is clearly needed. But just because an evaluation method involves quantitative measurements or testing falsifiable hypotheses doesn’t mean it’s a very valuable one. Evaluation can adhere to the scientific method while still not being useful for engineering. As an example, I confess to doing this myself in some past work (Hod et al., 2021). In order to test how useful different clusterings of neurons might be for studying networks, we solely used proxy measures. And while we did not claim to be "interpreting" the network by doing so, interpretability was our motivation. Another way this problem often appears is by testing on the training proxy. Sometimes researchers evaluate interpretability tools based on the loss function for whatever model, feature, mask, map, clustering, vector, distance, or other thing was optimized during training. Unless the loss in this case is the exact definition of what is cared about, this will lead to Goodharting. More examples are discussed below.
Again, the main issue here is the obvious one. It’s that not holding interpretability works to engineering-relevant evaluation standards won’t produce methods that are useful for engineering. But another closely-related problem is the commonality of ad hoc methods to evaluate tools. The interpretability field probably should -- but does not yet -- have clear and consistent evaluation methods. Instead, the norm is for every paper’s authors to independently introduce and apply their own approach to evaluation. This allows researchers to only select measures that make their technique look good.
Claiming that most interpretability research is not evaluated well is the kind of statement that demands some more concreteness. But showcasing arbitrary examples wouldn’t help much with this point. To try to give an unbiased sense of the state of the field, I went to the NeurIPS (the largest AI conference) 2021 (the most recent year for which the full list of papers is available at the time of writing this) and searched among all accepted papers that had “interpretability” in the title. There were 4. None of which evaluated their techniques in a way that an engineer would find very compelling.
Pan et al. (2021) claim that the feature attributions from their technique are “obviously better” than alternatives.
In summary, all four of four papers do not meaningfully evaluate methods by connecting them to anything of practical value. And to be clear, I only considered these 4 papers for this purpose – I didn’t cherrypick among selection methods. This is not to say that these papers are bad, uninteresting, or cannot be useful. But from the standpoint of an engineer who wants interpretability research to be rigorously approached and practically relevant, they all fall short of this goal.
Suppose I visualize a neuron in a CNN and that it looks like dogs to me.
From Olah et al. (2017)
Then suppose I say,
Nice, my feature visualization tool works! Look at this dog neuron it identified.
If I stopped at this point, this would just be intuition and pontification. And while this may not be a bad hypothesis, it can’t yet make for a conclusion.
Then say I pass some images through the network, look at the results, and say,
Just as I predicted – the neuron responds more consistently to dog images than non-dog ones.
This is still not enough. It’s too weak and ad hoc. From an engineer’s perspective, it’s not yet meaningful to say the neuron is a dog neuron unless I do something useful with that interpretation. And there are plenty of ways that a neuron which correlates with dog images could be doing something much more complicated than it seems at first. Olah et al. (2017) acknowledge this. See also Bolukbasi et al. (2021) for examples of such “interpretability illusions.”
But then finally, suppose I ablate the neuron from the network, run another experiment, and remark,
Aha! When I removed the neuron, the network stopped being able to classify dogs correctly but still performs the same on everything else. The same is true for OOD dog data.
Now we’re talking!
There is a growing consensus that more rigorous methods to evaluate interpretability tools are needed (Doshi-Velez & Kim, 2017; Lipton, 2018; Miller, 2019; Hubinger, 2021; Krishnan, 2020; Hendrycks & Woodside, 2022; CAIS, 2022; Räuker et al., 2022). So what does good evaluation look like? Evaluation tools should measure how competitive interpretability tools are for helping humans or automated processes do one of the following three things.
Notably, these three things logically partition the space of possible approaches: working with the inputs, working with the system, or getting rid of the whole thing and using something else.
Meaningful benchmarking in interpretability is almost nonexistent, but benchmarks are important for driving progress in a field. They concretize research goals, give indications of what approaches are the most useful, and spur community efforts (Hendrycks and Woodside, 2022).
To help demonstrate the value of benchmarking, some coauthors and I recently finished a paper (Casper et al., 2023). We use strategy #1 above and evaluate interpretability tools based on how helpful they are to humans who want to rediscover interpretable trojans. A useful thing about this benchmarking task is that trojan triggers can be arbitrary and may not appear in a particular dataset. So novel triggers cannot be discovered by simply analyzing the examples from a dataset that the network mishandles. Thus, rediscovering them mirrors the practical challenge of finding flaws that evade detection with a test set. In other words – this benchmarking task tests competitiveness for debugging.
We tested 9 different feature synthesis methods (rows) on 12 different trojans (columns) below. In the table below, each cell gives the proportion of the time that a method helped humans correctly identify a trojan trigger in a multiple choice test. See the paper for details.
From Casper et al. (2023)
Notice two things in the data. First, some methods perform poorly including TABOR (Guo et al., 2019) and three of the four feature visualization (FV) methods (Olah et al., 2017, Mordvintsev et al., 2018). So this experiment demonstrates how benchmarks can offer information about what does and doesn’t work well. Second, even the methods that do relatively well still fail to achieve a 50% success rate on average, so there is still more work to do to make these types of tools very reliable. From an engineer’s perspective, this is all valuable information.
There are many interpretability tools out there, so why did we only test 9 based on feature synthesis? This is because these 9 were the only ones of which we knew that are suited for this task at all. Most interpretability tools are only useful for analyzing how a network works on either specific examples or on a specific dataset (Räuker et al., 2022). In fact, very few are useful for studying how a network may (mis)behave on novel inputs. Only feature synthesis methods can be competitive for identifying novel trojan triggers because no non-synthesis method can give insights off a given data distribution. And when it comes to aligning highly intelligent and potentially deceptive systems, it seems likely that the failures that are difficult to find are going to be due to inputs well off the training distribution.
Many interpretability tools have only been demonstrated to work at a small scale such as small MLPs trained on MNIST or small transformers trained on toy problems. But simple networks performing simple tasks can only be deployed in a limited number of settings of any practical consequence, and they often should be replaced with other intrinsically interpretable, non-network models (Rudin, 2018). Working at a small scale is usually a prerequisite to scaling things up later, and some lessons that can be learned from small experiments may offer excellent inspiration for future work. But unless there exists a realistic pathway from research at a small scale to more useful work at a large one, small-scale work seems to be of little direct value.
Most approaches to interpretability rely on a human somewhere in the loop. And in some cases like much mechanistic interpretability work, an immense amount of human involvement is typically required. But if the goal of interpretability is to rigorously obtain a useful understanding of large systems, human involvement needs to be efficient. Ideally, humans should be used for screening interpretations instead of generating them. Or maybe we don’t need humans at all. This possibility will be discussed more in future posts.
Most interpretability techniques can be combined with most others. Why just use one technique or one type of evidence to examine when you can have a bunch? Our goal for interpretability should be to design a useful toolbox – not a silver bullet. And notice above in our figure from Casper et al. (2023) that the best results overall come from combining all of the 9 methods. Unfortunately, the large majority of work in interpretability focuses on studying tools individually. But combining different methods seems to be a useful way to make better engineering progress.
Consider an example. In the 2010s, immense progress was made on ImageNet classification. But improvements didn’t come from single techniques, but a combination of breakthroughs like batch normalization, residual connections, inception modules, deeper architectures, improved optimizers, etc. Similarly, we should not expect to best advance interpretability without a combination of methods.
Our ultimate goal for interpretability tools is to use them in the real world, so it only makes sense to do more practical work. It’s worth noting that the sooner we can get interpretability tools to be relevant in the real world, the sooner that actors in AI governance can think concretely about ways to incorporate standards related to interpretability into the regulatory regime.
Have you read the Redwood post on causal scrubbing? To me, it's an excellent example of evaluating interpretability using something other than intuition.
Thanks. I'll talk in some depth about causal scrubbing in two of the upcoming posts which narrow down discussion specifically to AI safety work. I think it's a highly valuable way of measuring how well a hypothesis seems to explain a network, but there are some pitfalls with it to be aware of.