Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Tilman Räuker* (
Anson Ho* (
Stephen Casper* (
Dylan Hadfield-Menell

TL;DR: We wrote a survey paper on interpretability tools for deep networks. It was written for the general AI community but with AI safety as the key focus. We survey over 300 works and offer 15 discussion points for guiding future work. Here is a link to a Twitter thread about the paper.

Lately, there has been a growing interest in interpreting AI systems and a growing consensus that it will be key for building safer AI. There have been rapid recent developments in interpretability work, and the AI safety community will benefit from a better systemization of knowledge for it. There are also several epistemic and paradigmatic issues with much interpretability work today. In response to these challenges, we wrote a survey paper covering over 300 works and featuring 15 somewhat “hot takes” to guide future work. 

Specifically, this survey focuses on “inner” interpretability methods that help explain internal parts of a network (i.e. not inputs, outputs, or the network as a whole). We do this because inner methods are popular and have some unique applications – not because we think that they are more valuable than other ones. The survey introduces a taxonomy of inner interpretability tools that organizes them by which part of the network’s computational graph they aim to explain: weights (S2), neurons (S3), subnetworks (S4), and latent representations (S5). Then we provide a discussion (S6) and propose directions for future work (S7).

Finally, here are a select few points that we would like to specifically highlight here. 

  1. Interpretability does not just mean circuits. In the survey sections of the paper (S2-S5), there are 21 subsections, and only one is about circuits. The circuits paradigm has received disproportionate attention in the AI safety community, partly due to Distill’s influential interpretability research in the past few years. But given how many other useful approaches there are, it would be myopic to focus too much on them. 
  2. Interpretability research has close connections to work in adversarial robustness, continual learning, modularity, network compression, and studying the human visual system. For example, adversarially trained networks tend to be more interpretable, and more interpretable networks tend to be more adversarially robust.
  3. Interpretability tools generate hypotheses, not conclusions. Simply analyzing the outputs of an interpretability technique and pontificating about what they mean is a problem with much interpretability work – including AI safety work. There are many examples of when this type of approach fails to produce faithful explanations.
  4. Interpretability tools should be more rigorously evaluated. There are currently no broadly established ways to do this. Benchmarks for evaluating interpretability tools can and should be popularized. The ultimate goal of interpretability work should be tools that give us insights that are valid and useful. Ideally, interpretations should be used to make and validate useful predictions that engineers can use. So benchmarks should be created which measure how well interpretability tools can help us understand systems well enough to do engineering-relevant things with them. Examples of this could be using interpretability tools for reverse engineering a system, manually finetuning a model to introduce a predictable change in behavior, or designing a novel adversary. The Automated Auditing agenda may offer a useful paradigm for this – judging techniques by their ability to help humans rediscover known flaws in systems. 
  5. Cherry picking is common and harmful. Evaluation of interpretability tools should not fixate on best-case performance. 

We hope that this survey can give researchers in AI safety a broader sense of what kind of work has been done in interpretability, serve as a useful reference, and also stimulate ideas for further investigation. Overall, we are excited about how much interpretability work has been done in the past few years. And we are looking forward to future progress. Please reach out to the three of us in an email if you’d like to talk.

New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 3:59 PM

What do you think are the top 3 (or top 5, or top handful) of interpretability results to date? If I gave a 5-minute talk called "The Few Greatest Achievements of Interpretability to Date," what would you recommend I include in the talk?

My answer to this is actually tucked into one paragraph on the 10th page of the paper: "This type of approach is valuable...reverse engineering a system". We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.

Making adversaries: 

Manual fine-tuning: 

Reverse engineering (I'd put an asterisk on these ones though because I don't expect methods like this to scale well to non-toy problems): 

Nice paper, thanks! A meta question - how did you analyse and systematise the results of over 300 papers? (gesturing at software tools/general methodology here)

The taxonomy we introduced in the survey gave a helpful way of splitting up the problem. Other than that, it took a lot of effort, several google docs that got very messy, and Personally, I've also been working on interpretability for a while and have passively formed a mental model of the space. 

(Moderation note: added to the Alignment Forum from LessWrong.)

Nice work! Two good points from the paper:

  • "Works should evaluate how their techniques perform on randomly or adversarially sampled tasks"
  • "...highlights a need for techniques that allow a user to discover failures that may not be in a typical dataset or easy to think of in advance. This represents one of the unique potential benefits of interpretability methods compared to other ways of evaluating models such as test performance"

This is relevant to my interests, thanks!