Interpretability research is conducted to improve our understanding of AI. Many see interpretability as essential for AI safety, but recently some have argued that it can also increase the risk posed by AI by facilitating improved AI capabilities. We agree, and in this post, we’ll explain why, as well as how risks can be reduced.
The more complex a model is, the harder it is for humans to understand. It’s easy to get why a tic-tac-toe AI makes a move, but understanding why AlphaGo makes particular moves in Go is beyond our comprehension. We can describe properties of its algorithm and training - convolutional neural nets something something Monte Carlo tree searches - but it’s incredibly difficult to explain its decision to place a piece in a position past “it decided that it’s a good move”.
This causes problems! Models we don’t understand are harder to use, harder to improve, and riskier to rely on. Interpretability research aims to solve these problems by rendering the reasoning of AI more understandable by humans.
Interpretability research ranges from simple, standard technique like feature importance, where twiddling inputs and monitoring outputs lets you compare the relative importance of those inputs, to complex mechanistic research, such as in Neel Nanda’s introduction to mechanistic interpretability or this great paper on “grokking”.
We’re concerned about interpretability research, and a growing number of researchers have also argued that the field is not entirely innocuous, such as Nate Soares in If interpretability research goes well, it may get dangerous, Marius Hobbhahn & LawrenceC in Should we publish mechanistic interpretability research?, and several others. We think that interpretability research can introduce risks, and that despite recent discussion, those risks are still being neglected. In the rest of this post, we’ll explore the risks and how we can reduce them.
Sufficiently powerful AI will pose an existential threat to society. AI development is accelerating, and the more powerful AI becomes, the greater the existential risk it poses. Interpretability research can be used to make AI more powerful: the better a model is understood, the easier it is to improve. Therefore, interpretability research can increase the existential risk posed by AI.
For example, Deepmind’s recent Chinchilla paper identified a novel pattern in scaling laws - laws explaining how variables like model size should scale relative to computing power and so on (you can read more here). This was an insight into how large language models work that directly lead to an increase in capabilities: Deepmind trained a new model that outperformed its competitors, and this insight could apply to the LLMs developed by OpenAI, DeepMind, Microsoft, and other orgs.
Looking ahead, insights that would help an AI interpret itself are especially daunting. The capabilities of a self-improving AI could rapidly outpace our capacity to understand or control it, and the more self-understanding an AI has, the easier its self-improvement will be. You might argue that people simply won’t provide interpretability tools to AI because that’s obviously dangerous, but we’ve already seen AI used in the optimization of AI, and even GPT-4 used to explain the behaviour of individual neurons in GPT-2. If GPT-n or some other model can explain the behaviour of its own neurons, and it knows how to code like GPT-4, this could lead to rapid self-improvement.
Interpretability can also indirectly contribute to capability insights. For example, quoting Marius Hobbhahn & LawrenceC:
Interpretability work can generate an insight that is then used as a motivation for capabilities work. Two recent capability papers (see here and here) have cited the induction heads work by Anthropic as a core motivation (confirmed by authors).
Of course, the additional understanding we get from interpretability research doesn’t only go towards increasing capabilities: it absolutely can be used to improve safety, as argued in Neel Nanda’s Longlist of Theories of Impact for Interpretability. The question is: how do we ensure the benefits to safety outweigh the increased risk?
So, how can we tip the balance and maximise the impact of interpretability research on safety while minimizing the risk it introduces? The answer will change over time as capabilities and safety evolve, and will vary between particular projects. As Nate Soares writes:
...it's entirely plausible that a better understanding of the workings of modern AI systems will help capabilities researchers significantly improve capabilities. I acknowledge that this sucks, and puts us in a bind. I don't have good solutions…researchers will have to figure out what they think of the costs and benefits… if the field succeeds to the degree required to move the strategic needle then it's going to start stumbling across serious capabilities improvements before it saves us...
Here are some heuristics to consider if you’re involved or interested in interpretability research (in ascending order of nuance):
An imperfect analogy is smallpox virus retention. Smallpox was eradicated by 1980, but some samples of the virus are kept in ultra-secure labs because they made be of scientific value, perhaps in the event of the use of smallpox or a variant of it as a bioweapon. Analogously, powerful interpretability insights may be useful if they can be “contained” without leakage into capabilities.
Of course, the risk-reward tradeoff varies depending on the specific interpretability research topic and context, and we should caveat our recommendations:
Our goal with this article is not to denigrate interpretability researchers. Instead, we want to raise awareness of the potential side effects of this kind of research and help safety-oriented people make informed decisions about where to focus their work. Such changes tip the balance, and to quote one of 80k's anonymous experts on the subject:
...any marginal slowdown of capabilities advancements and any marginal acceleration of work on alignment is important if we hope to solve the problem on time… individuals concerned about AI safety should be very careful before deciding to work on capabilities, and should strongly consider working on alignment and AI safety directly whenever possible. This is especially the case as the AI field is small and has an extreme concentration of talent: top ML researchers and engineers single handedly contribute to large amounts of total progress.
This is a complex topic, and we’d love to hear your thoughts in the comments. Do you have any insight on safer and riskier areas within interpretability research? Do you have any good examples of interpretability insights leading to capability increases, or insights that clearly have no impact on capabilities?
Here are some more posts discussing this and adjacent topics:
This article is based on the ideas of Justin Shovelain, written by Elliot Mckernon, for Convergence Analysis. We’d like to thank the authors of the posts we’ve quoted, and Henry Sleight for his feedback on a draft.
Thank you for this post! As I may have mentioned to you both, I had not followed this line of research until the two of you brought it to my attention. I think the post does an excellent job describing the trade offs around interpretability research and why we likely want to push it in certain, less risky directions. In this way, I think the post is a success in that it is accessible and lays out easy to follow reasoning, sources, and examples. Well done!
I have a couple of thoughts on the specific content as well where I think my intuitions converge or diverge somewhat:
I think your intuition of focusing on human interpretability of AI as being more helpful to safety than on AI interpretability of AI is correct and it seems to me that AI interpretability of AI is a pretty clear pathway to automating AI R&D which seems fraught with risk. It seems that that this “machine-to-machine translation” is already well underway. It may also be the case that this is something of an inevitable path forward at this point.
I’m happy to persuaded otherwise, but I think the symbol grounding case is less ambiguous than you seem to suppose. It’s also not completely clear to me what you mean by symbol grounding, but if it roughly means something like “grounding current neural net systems with symbols that clearly represent specific concepts” or “representing fuzzy knowledge graphs within a neural net with more clearly identifiable symbols” this seems to me to more heavily weigh on the side of a positive thing. This seems like a significant increase in controllability over leading current approaches that, without these forms of interpretability, are minds unknown. I do take your point that this complementary symbolic approach may help current systems break through some hard limit on capabilities, but it seems to me that if these capabilities are increased by the inclusion of a symbolic grounding that these increased capabilities may still be more controllable than a version of a model with even less capabilities that doesn’t include them.
I don’t think I agree that any marginal slowdown in capabilities, de facto, helps with alignment and safety research. (I don’t think this is your claim, but it is the claim of the anonymous expert from the 80k study.) It seems to me to be much more about what type of capabilities we are talking about. For example, I think on the margin a slightly more capable multi-modal LLM that is more capable in the direction of making itself more interpretable to humans would likely be a net positive for the safety of those systems.
Thanks again for this post! I found it very useful and thought-provoking. I also find myself now concerned with the direction of interpretability research as well. I do hope that people in this area will follow your advice and choose their research topics carefully and certainly focus on methods that improve human interpretability of AI.