Thank you for this post! As I may have mentioned to you both, I had not followed this line of research until the two of you brought it to my attention. I think the post does an excellent job describing the trade offs around interpretability research and why we likely want to push it in certain, less risky directions. In this way, I think the post is a success in that it is accessible and lays out easy to follow reasoning, sources, and examples. Well done!
I have a couple of thoughts on the specific content as well where I think my intuitions converge or diverge somewhat:
I think your intuition of focusing on human interpretability of AI as being more helpful to safety than on AI interpretability of AI is correct and it seems to me that AI interpretability of AI is a pretty clear pathway to automating AI R&D which seems fraught with risk. It seems that that this “machine-to-machine translation” is already well underway. It may also be the case that this is something of an inevitable path forward at this point.
I’m happy to persuaded otherwise, but I think the symbol grounding case is less ambiguous than you seem to suppose. It’s also not completely clear to me what you mean by symbol grounding, but if it roughly means something like “grounding current neural net systems with symbols that clearly represent specific concepts” or “representing fuzzy knowledge graphs within a neural net with more clearly identifiable symbols” this seems to me to more heavily weigh on the side of a positive thing. This seems like a significant increase in controllability over leading current approaches that, without these forms of interpretability, are minds unknown. I do take your point that this complementary symbolic approach may help current systems break through some hard limit on capabilities, but it seems to me that if these capabilities are increased by the inclusion of a symbolic grounding that these increased capabilities may still be more controllable than a version of a model with even less capabilities that doesn’t include them.
I don’t think I agree that any marginal slowdown in capabilities, de facto, helps with alignment and safety research. (I don’t think this is your claim, but it is the claim of the anonymous expert from the 80k study.) It seems to me to be much more about what type of capabilities we are talking about. For example, I think on the margin a slightly more capable multi-modal LLM that is more capable in the direction of making itself more interpretable to humans would likely be a net positive for the safety of those systems.
Thanks again for this post! I found it very useful and thought-provoking. I also find myself now concerned with the direction of interpretability research as well. I do hope that people in this area will follow your advice and choose their research topics carefully and certainly focus on methods that improve human interpretability of AI.
Interpretability research is conducted to improve our understanding of AI. Many see interpretability as essential for AI safety, but recently some have argued that it can also increase the risk posed by AI by facilitating improved AI capabilities. We agree, and in this post, we’ll explain why, as well as how risks can be reduced.
What is interpretability research?
The more complex a model is, the harder it is for humans to understand. It’s easy to get why a tic-tac-toe AI makes a move, but understanding why AlphaGo makes particular moves in Go is beyond our comprehension. We can describe properties of its algorithm and training - convolutional neural nets something something Monte Carlo tree searches - but it’s incredibly difficult to explain its decision to place a piece in a position past “it decided that it’s a good move”.
This causes problems! Models we don’t understand are harder to use, harder to improve, and riskier to rely on. Interpretability research aims to solve these problems by rendering the reasoning of AI more understandable by humans.
Interpretability research ranges from simple, standard technique like feature importance, where twiddling inputs and monitoring outputs lets you compare the relative importance of those inputs, to complex mechanistic research, such as in Neel Nanda’s introduction to mechanistic interpretability or this great paper on “grokking”.
We’re concerned about interpretability research, and a growing number of researchers have also argued that the field is not entirely innocuous, such as Nate Soares in If interpretability research goes well, it may get dangerous, Marius Hobbhahn & LawrenceC in Should we publish mechanistic interpretability research?, and several others. We think that interpretability research can introduce risks, and that despite recent discussion, those risks are still being neglected. In the rest of this post, we’ll explore the risks and how we can reduce them.
The tradeoff
Sufficiently powerful AI will pose an existential threat to society. AI development is accelerating, and the more powerful AI becomes, the greater the existential risk it poses. Interpretability research can be used to make AI more powerful: the better a model is understood, the easier it is to improve. Therefore, interpretability research can increase the existential risk posed by AI.
For example, Deepmind’s recent Chinchilla paper identified a novel pattern in scaling laws - laws explaining how variables like model size should scale relative to computing power and so on (you can read more here). This was an insight into how large language models work that directly lead to an increase in capabilities: Deepmind trained a new model that outperformed its competitors, and this insight could apply to the LLMs developed by OpenAI, DeepMind, Microsoft, and other orgs.
Looking ahead, insights that would help an AI interpret itself are especially daunting. The capabilities of a self-improving AI could rapidly outpace our capacity to understand or control it, and the more self-understanding an AI has, the easier its self-improvement will be. You might argue that people simply won’t provide interpretability tools to AI because that’s obviously dangerous, but we’ve already seen AI used in the optimization of AI, and even GPT-4 used to explain the behaviour of individual neurons in GPT-2. If GPT-n or some other model can explain the behaviour of its own neurons, and it knows how to code like GPT-4, this could lead to rapid self-improvement.
Interpretability can also indirectly contribute to capability insights. For example, quoting Marius Hobbhahn & LawrenceC:
Of course, the additional understanding we get from interpretability research doesn’t only go towards increasing capabilities: it absolutely can be used to improve safety, as argued in Neel Nanda’s Longlist of Theories of Impact for Interpretability. The question is: how do we ensure the benefits to safety outweigh the increased risk?
Tipping the balance
So, how can we tip the balance and maximise the impact of interpretability research on safety while minimizing the risk it introduces? The answer will change over time as capabilities and safety evolve, and will vary between particular projects. As Nate Soares writes:
Here are some heuristics to consider if you’re involved or interested in interpretability research (in ascending order of nuance):
An imperfect analogy is smallpox virus retention. Smallpox was eradicated by 1980, but some samples of the virus are kept in ultra-secure labs because they made be of scientific value, perhaps in the event of the use of smallpox or a variant of it as a bioweapon. Analogously, powerful interpretability insights may be useful if they can be “contained” without leakage into capabilities.
Caveats
Of course, the risk-reward tradeoff varies depending on the specific interpretability research topic and context, and we should caveat our recommendations:
Conclusion
Our goal with this article is not to denigrate interpretability researchers. Instead, we want to raise awareness of the potential side effects of this kind of research and help safety-oriented people make informed decisions about where to focus their work. Such changes tip the balance, and to quote one of 80k's anonymous experts on the subject:
This is a complex topic, and we’d love to hear your thoughts in the comments. Do you have any insight on safer and riskier areas within interpretability research? Do you have any good examples of interpretability insights leading to capability increases, or insights that clearly have no impact on capabilities?
Here are some more posts discussing this and adjacent topics:
This article is based on the ideas of Justin Shovelain, written by Elliot Mckernon, for Convergence Analysis. We’d like to thank the authors of the posts we’ve quoted, and Henry Sleight for his feedback on a draft.