This is the first post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. If you want to learn the basics before you think about open problems, check out my post on getting started.
Skip to the final section of this post for an overview of the posts in the sequence
Mechanistic Interpretability (MI) is the study of reverse engineering neural networks. Taking an inscrutable stack of matrices where we know that it works, and trying to reverse engineer how it works. And often this inscrutable stack of matrices can be decompiled to a human interpretable algorithm! In my (highly biased) opinion, this is one of the most exciting research areas in ML.
There are a lot of reasons to care about mechanistic interpretability research happening. First and foremost, I think that mechanistic interpretability done right can both be highly relevant for alignment. In particular, can we tell whether a model is doing a task well because it’s deceiving us or because it genuinely wants to be helpful? Without being able to look at how a task is being done, these are essentially indistinguishable when facing a sufficiently capable model. But it also has a lot of fascinating scientific questions - how do models actually work? Are there fundamental principles and laws underlying them, or is it all an inscrutable mess?
It is a fact about today’s world that there exist computer programs like GPT-3 that can essentially speak English at a human level, but we have no idea how to write these programs in normal code. It offends me that this is the case, and I see part of the goal of mechanistic interpretability as solving this! And I think that this would be a profound scientific accomplishment.
In addition to being very important, mechanistic interpretability is also a very young field, full of low-hanging fruit. There are many fascinating open research questions that might have really impactful results! The point of this sequence is to put my money where my mouth is, and make this concrete. Each post in this sequence is a different category where I think there’s room for significant progress, and a brainstorm of concrete open problems in that area.
Further, you don’t need a ton of experience to start getting traction on interesting problems! I have an accompanying post, with advice on how to build the background skills. The main audience I have in mind for the sequence is people new to the field, who want an idea for where to start. The problems span the spectrum from good, toy, intro problems, to things that could be a serious and impactful research project if well executed, and are accompanied by relevant resources and advice for doing good research. One of the great joys of mechanistic interpretability is that you can get cool results in small models or by interpreting a model that someone else trained. It’s full of rich empirical data and feedback loops, and getting your hands dirty by playing around with a model and trying to make progress is a great way to learn and build intuition!
My hope is that after reading this sequence you’ll have a clear sense of the contours of this field, where value can be added and how to get started pushing on it, and have a better sense of where the low-hanging fruit is.
Disclaimer: As a consequence of this being aimed at people getting into the field, I've tried to focus on concrete problems where I think it'd be easiest to get traction. There are many impactful problems and directions that are less concrete and I haven't focused on. Feel free to reach out if you have research experience and want more nuanced takes about where to focus!
Each post corresponds to a (rough) category of open problems. Each post has several sections:
A brief overview of each post in this sequence, plus an example problem.
This post benefitted greatly from feedback from many people. Thanks to Uzay Girit, Euan Ong, Stephen Casper, Marius Hobhahn, Oliver Balfour, Arthur Conmy, Alexandre Variengien, Ansh Radhakrishnan, Joel Burget, Denizhan Akar, Haoxing Du, Esben Kran, Chris Mathwin, Lee Sharkey, Lawrence Chan, Arunim Agarwal, Callum McDougall, Alan Cooney.
Thanks especially to Jess Smith for inspiring this post and helping write a rough initial draft!