Interpretability research is conducted to improve our understanding of AI. Many see interpretability as essential for AI safety, but recently some have argued that it can also increase the risk posed by AI by facilitating improved AI capabilities. We agree, and in this post, we’ll explain why, as well as how risks can be reduced. 

What is interpretability research?

The more complex a model is, the harder it is for humans to understand. It’s easy to get why a tic-tac-toe AI makes a move, but understanding why AlphaGo makes particular moves in Go is beyond our comprehension. We can describe properties of its algorithm and training - convolutional neural nets something something Monte Carlo tree searches - but it’s incredibly difficult to explain its decision to place a piece in a position past “it decided that it’s a good move”. 

This causes problems! Models we don’t understand are harder to use, harder to improve, and riskier to rely on. Interpretability research aims to solve these problems by rendering the reasoning of AI more understandable by humans. 

Interpretability research ranges from simple, standard technique like feature importance, where twiddling inputs and monitoring outputs lets you compare the relative importance of those inputs, to complex mechanistic research, such as in Neel Nanda’s introduction to mechanistic interpretability or this great paper on “grokking”.

We’re concerned about interpretability research, and a growing number of researchers have also argued that the field is not entirely innocuous, such as Nate Soares in If interpretability research goes well, it may get dangerous, Marius Hobbhahn & LawrenceC in Should we publish mechanistic interpretability research?and several others. We think that interpretability research can introduce risks, and that despite recent discussion, those risks are still being neglected. In the rest of this post, we’ll explore the risks and how we can reduce them.

The tradeoff

Sufficiently powerful AI will pose an existential threat to society. AI development is accelerating, and the more powerful AI becomes, the greater the existential risk it poses. Interpretability research can be used to make AI more powerful: the better a model is understood, the easier it is to improve. Therefore, interpretability research can increase the existential risk posed by AI.

For example, Deepmind’s recent Chinchilla paper identified a novel pattern in scaling laws - laws explaining how variables like model size should scale relative to computing power and so on (you can read more here). This was an insight into how large language models work that directly lead to an increase in capabilities: Deepmind trained a new model that outperformed its competitors, and this insight could apply to the LLMs developed by OpenAI, DeepMind, Microsoft, and other orgs. 

Looking ahead, insights that would help an AI interpret itself are especially daunting. The capabilities of a self-improving AI could rapidly outpace our capacity to understand or control it, and the more self-understanding an AI has, the easier its self-improvement will be. You might argue that people simply won’t provide interpretability tools to AI because that’s obviously dangerous, but we’ve already seen AI used in the optimization of AI, and even GPT-4 used to explain the behaviour of individual neurons in GPT-2If GPT-n or some other model can explain the behaviour of its own neurons, and it knows how to code like GPT-4, this could lead to rapid self-improvement. 

Interpretability can also indirectly contribute to capability insights. For example, quoting Marius Hobbhahn & LawrenceC

Interpretability work can generate an insight that is then used as a motivation for capabilities work. Two recent capability papers (see here and here) have cited the induction heads work by Anthropic as a core motivation (confirmed by authors).

Of course, the additional understanding we get from interpretability research doesn’t only go towards increasing capabilities: it absolutely can be used to improve safety, as argued in Neel Nanda’s Longlist of Theories of Impact for Interpretability. The question is: how do we ensure the benefits to safety outweigh the increased risk? 

Tipping the balance

So, how can we tip the balance and maximise the impact of interpretability research on safety while minimizing the risk it introduces? The answer will change over time as capabilities and safety evolve, and will vary between particular projects. As Nate Soares writes:'s entirely plausible that a better understanding of the workings of modern AI systems will help capabilities researchers significantly improve capabilities. I acknowledge that this sucks, and puts us in a bind. I don't have good solutions…researchers will have to figure out what they think of the costs and benefits… if the field succeeds to the degree required to move the strategic needle then it's going to start stumbling across serious capabilities improvements before it saves us...

Here are some heuristics to consider if you’re involved or interested in interpretability research (in ascending order of nuance):

  • Research safer topics instead. There are many research areas in AI safety, and if you want to ensure your research is net positive, one way is to focus on areas without applications to AI capabilities.
  • Research safer sub-topics within interpretability. As we’ll discuss in the next section, some areas are riskier than others - changing your focus to a less risky area could ensure your research is net positive.
  • Conduct interpretability research cautiously, if you’re confident you can do interpretability research safely, with a net-positive effect. In this case: 
    • Stay cautious and up to date. Familiarize yourself with the ways that interpretability research can enhance capabilities, and update and apply this knowledge to keep your research safe.
    • Advocate for caution publicly. 
    • Carefully consider what information you share with whomThis particular topic is covered in detail in Should we publish mechanistic interpretability research?, but to summarise: it may be beneficial to conduct interpretability research and share it only with select individuals and groups, ensuring that any potential benefit to capability enhancement isn’t used for such. 

An imperfect analogy is smallpox virus retention. Smallpox was eradicated by 1980, but some samples of the virus are kept in ultra-secure labs because they made be of scientific value, perhaps in the event of the use of smallpox or a variant of it as a bioweapon. Analogously, powerful interpretability insights may be useful if they can be “contained” without leakage into capabilities.


Of course, the risk-reward tradeoff varies depending on the specific interpretability research topic and context, and we should caveat our recommendations:

  • Some interpretability research is safer. Naturally, some areas within interpretability are safer than others. For example: 
    • We believe that interpretability research that focuses on figuring out an AI’s values is much safer than figuring out its plans, since the latter can be fed back in to improve its planning, increasing capabilities. 
    • Similarly, interpretability to humans is much safer than interpretability to AI. As we described in section 2, self-interpretability would make self-improvement much easier, and self-improving AI is an existential threat. For more on this, check out AGI-Automated Interpretability is Suicide
    • An ambiguous case is symbol grounding. This could be extremely valuable in alignment by allowing us to encode humanity’s values implicitly in some sense. However, it’s also possible that AI’s inflexible ontology is currently a hard limit on their capabilities, and perhaps interpretability insights and architecture changes would overcome that limit. On the other hand, perhaps the limit can be overcome merely with more compute, in which case the interpretability insights are safer. 
    • For more, check out Nicholas Kross’s Why and When Interpretability Work is Dangerous
  • Some interpretability research is so niche, it’s unlikely to lead to increasing capabilities or risks. Research that delves into the depths of a specific model without wider application is less likely to apply broadly or have insights salient to capabilities. One example of this may be Neel Nanda’s intricate mechanistic analysis of a neural network’s modular arithmetic technique, which is unlikely to have direct applications to capabilities (though improving general techniques and gaining general understanding can have unforeseen applications).
  • Leaving interpretability research to commercial interests may reduce the safety of that research. If safety-minded researchers leave the field, the research may still be done due to the commercial incentive to improve interpretability. This may mean that the work is done less cautiously, and promoted and publicised more recklessly. Having said that, we suspect that commercially-oriented researchers vastly outnumber safety-oriented researchers, so the effect size of the safety-oriented researchers departing the field may be small, while their benefit could be large if they focus on, for example, technical AI safety research. 
  • If superpowerful AI was going to be deployed no matter what, then interpretability research becomes more valuable and less dangerous. Ultimately, if we have an AGI, then safety will require interpretability. We need to understand what it’s up to and why, and at this point, worrying about capability-enhancing side effects may be moot. 


Our goal with this article is not to denigrate interpretability researchers. Instead, we want to raise awareness of the potential side effects of this kind of research and help safety-oriented people make informed decisions about where to focus their work. Such changes tip the balance, and to quote one of 80k's anonymous experts on the subject:

...any marginal slowdown of capabilities advancements and any marginal acceleration of work on alignment is important if we hope to solve the problem on time… individuals concerned about AI safety should be very careful before deciding to work on capabilities, and should strongly consider working on alignment and AI safety directly whenever possible. This is especially the case as the AI field is small and has an extreme concentration of talent: top ML researchers and engineers single handedly contribute to large amounts of total progress.

This is a complex topic, and we’d love to hear your thoughts in the comments. Do you have any insight on safer and riskier areas within interpretability research? Do you have any good examples of interpretability insights leading to capability increases, or insights that clearly have no impact on capabilities? 

Here are some more posts discussing this and adjacent topics:

This article is based on the ideas of Justin Shovelain, written by Elliot Mckernon, for Convergence Analysis. We’d like to thank the authors of the posts we’ve quoted, and Henry Sleight for his feedback on a draft.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 1:29 AM

Thank you for this post! As I may have mentioned to you both, I had not followed this line of research until the two of you brought it to my attention. I think the post does an excellent job describing the trade offs around interpretability research and why we likely want to push it in certain, less risky directions. In this way, I think the post is a success in that it is accessible and lays out easy to follow reasoning, sources, and examples. Well done!

I have a couple of thoughts on the specific content as well where I think my intuitions converge or diverge somewhat:

  • I think your intuition of focusing on human interpretability of AI as being more helpful to safety than on AI interpretability of AI is correct and it seems to me that AI interpretability of AI is a pretty clear pathway to automating AI R&D which seems fraught with risk. It seems that that this “machine-to-machine translation” is already well underway. It may also be the case that this is something of an inevitable path forward at this point.

  • I’m happy to persuaded otherwise, but I think the symbol grounding case is less ambiguous than you seem to suppose. It’s also not completely clear to me what you mean by symbol grounding, but if it roughly means something like “grounding current neural net systems with symbols that clearly represent specific concepts” or “representing fuzzy knowledge graphs within a neural net with more clearly identifiable symbols” this seems to me to more heavily weigh on the side of a positive thing. This seems like a significant increase in controllability over leading current approaches that, without these forms of interpretability, are minds unknown. I do take your point that this complementary symbolic approach may help current systems break through some hard limit on capabilities, but it seems to me that if these capabilities are increased by the inclusion of a symbolic grounding that these increased capabilities may still be more controllable than a version of a model with even less capabilities that doesn’t include them.

  • I don’t think I agree that any marginal slowdown in capabilities, de facto, helps with alignment and safety research. (I don’t think this is your claim, but it is the claim of the anonymous expert from the 80k study.) It seems to me to be much more about what type of capabilities we are talking about. For example, I think on the margin a slightly more capable multi-modal LLM that is more capable in the direction of making itself more interpretable to humans would likely be a net positive for the safety of those systems.

Thanks again for this post! I found it very useful and thought-provoking. I also find myself now concerned with the direction of interpretability research as well. I do hope that people in this area will follow your advice and choose their research topics carefully and certainly focus on methods that improve human interpretability of AI.