This post gives an overview of discussions - from the perspective and understanding of the interpretability team at Conjecture - between mechanistic interpretability researchers from various organizations including Conjecture, Anthropic, Redwood Research, OpenAI, and DeepMind as well as some independent researchers. It is not a review of past work, nor a research agenda. We're thankful for comments and contributions from Neel Nanda, Tristan Hume, Chris Olah, Ryan Greenblatt, William Saunders, and other anonymous contributors to this post, which greatly improved its quality. While the post is a summary of discussions with many researchers and received comments and contributions from several, it may nevertheless not accurately represent their views.
The last two to three years have seen a surge in interest in mechanistic interpretability as a potential path to AGI safety. Now there are no fewer than five organizations working on the topic (Anthropic, Conjecture, DeepMind, OpenAI, Redwood Research) in addition to numerous academic and independent researchers.
In discussions about mechanistic interpretability between a subset of researchers, several themes emerged. By summarizing these themes here, we hope to facilitate research in the field more broadly.
We identify groups of themes that concern:
Anthropic’s recent article on Toy Model of Superposition laid out a compelling case that superposition is a real phenomenon in neural networks. Superposition appears to be one of the reasons that polysemanticity happens, which makes mechanistic interpretability very difficult because it prevents us from telling simple stories about how features in one layer are constructed from features in previous layers.
A solution to superposition will look like the ability to enumerate all the features that a network represents, even if they’re represented in superposition. If we can do that, then we should be able to make statements like “For all features in the neural network, none violate rule X” (and more ambitiously, for "no features with property X participate in circuits which violate property Y"). Researchers at Anthropic hope this might enable ‘enumerative safety’, which might allow checking random samples or comprehensive investigations of safety-critical parts of the model for unexpected and concerning components. There are many potential reasons researchers could fail to achieve enumerative safety, including failing to solve superposition, scalability challenges, and several other barriers described in the next section.
Anthropic outlined several potential solutions to superposition in their article. Very briefly, these strategies are:
Multiple organizations are pursuing these strategies. Researchers in all organizations are keen to hear from people interested in working together on this problem. However, there is a range of views among researchers on how central superposition is as a problem and how tractable it is.
We’ve been blaming superposition for rather a lot of our interpretability woes, which risks giving the misleading impression that a solution to superposition is a solution to mechanistic interpretability. But this seems unlikely. What other problems are we likely to bump up against when interpreting neural networks?
Viewing features as directions in activation space assumes that representations are primarily linear. Anthropic have discussed some of the reasons why we can expect representations to be mostly linear. But nonlinear representations are also possible. In nonlinear representations, networks assign different features to activation vectors that have similar directions but different magnitudes. This means that feature-interpretations that are valid in one context are invalid in others. It might be possible to fool ourselves into thinking that a capable model is safe if we look only at its linear representations and not its nonlinear representations.
We yet don’t know the full range of possible representations in transformers or other future architectures. There may be kinds of representations that we don’t yet know how to recognise. One such example might be ‘variable binding’ in Vector Symbolic Architectures, which transformers might be able to emulate.
Discussions between mechanistic interpretability researchers revealed differences on how messy they expected neural network representations to be:
Which is correct? Probably both - Different networks and tasks will likely result in networks closer to one end of the spectrum or the other. The important question is where researchers expect large transformers to lie on this spectrum. Most mechanistic interpretability researchers expect that they lie in-between, close to neither extreme.
Even absent extreme views, disagreement between researchers on this question leads to meaningfully different predictions about mechanistic interpretability. For instance, if you expect networks to be collections of dense correlations, then you might put less emphasis on identifying particular circuits or features in them; instead, you might emphasize building up causal models of network behavior in safety-critical settings on a higher level of abstraction.
Inasmuch as identifiable circuits exist in neural networks, they must be learned at specific times during training. One example is induction heads. Researchers at Anthropic discovered that the learning of induction heads caused a consistent drop in language model loss curves at a particular phase in training (the ‘induction bump’). There are likely other such circuits waiting to be discovered. If we can characterize them all, we might be able to predict what large models are learning as well as when and why they’re learning it, which will be helpful for ensuring model safety.
Chris Olah suggests that even seemingly-smooth learning curves may be composed of lots of small bumps resulting from the emergence of particular circuits, and how there might be even more patterns common across models.
Mechanistic interpretability involves understanding the representations learned by deep learning systems. Deep learning theory will therefore probably shed light on how to think about those representations fundamentally. Questions in deep learning theory might therefore be tempting targets of inquiry for mechanistic interpretability researchers. Researchers should be cautious when discussing these questions in public, since their answers might be useful for improving capabilities (This is also true for other, more empirical results in mechanistic interpretability).
It’s an open question how relevant deep learning theory questions will be to mechanistic interpretability. Here we include a (very incomplete) list of topics that we think might be relevant to a mechanistic understanding of the representations learned by deep networks.
More generally, there is interest among researchers in how mechanistic interpretability might serve as a "microscopic theory" of deep learning, in contrast to something like scaling laws as a "macroscopic theory". This frame suggests seeking bridges from microscopic properties like circuits to macroscopic properties like loss curves or scaling laws.
Judging by the current pace of progress in AI capabilities, we might very soon be able to automate some components of interpretability research. Some signs of life exist in work that uses models to produce descriptions of neurons in image models or describe differences between text distributions. Assuming further automation becomes possible in the short- to medium-term future, how should interpretability research anticipate these changes and adapt?
Increasing automation elevates the importance of thinking about the ‘automated interpretability OODA loop’ in which we use models to help us interpret networks and decide which experiments or interventions to perform on them. One near-term-automatable component of this loop might be the labeling of neurons or directions. If this becomes possible, interpretability research will look less like a warehouse of researchers trying to identify the common features shared by collections of dataset examples and more like getting capable models to do the labeling work; to quantify their uncertainty about the labels; and to propose experiments to reduce this uncertainty. Eventually, we might also want to automate the process of deciding which interventions to perform on the model to improve AI safety.
Increasing automation also elevates the importance of interpretability theory, since we’ll want to be sure that our automated analyses don’t have systematic blindspots. For instance, automatically labeling polysemantic neurons will yield polysemantic labels, which aren’t very helpful for human-understandable, mechanistic descriptions of neural networks.
Interpretability demands good epistemics, which can be hard! This challenge is made especially difficult by the complexity of the objects that we’re studying. How do we avoid fooling ourselves about what our models are doing under the hood? How can we be sure we’re making progress?
One of the ways to get around this is to test our interpretability approaches on simpler models where it’s easier to tell if our findings are true or not. There are a few potential ways to do this:
Biologists study ‘model systems’, such as Drosophila and mice, not because these species are especially fascinating, but because they have already been studied in depth by other researchers. By focusing on species that are already well studied, biologists can build on previous work, gain more general insights, and devise more powerful tools than permitted by only shallow studies of many different species.
InceptionV1 has served as a model system for early mechanistic interpretability work in convolutional image classifiers (see Circuits thread). But no model system has emerged for transformers yet. What should be the Drosophila and mouse of mechanistic interpretability? It seems worthwhile to choose our model systems carefully. Some desiderata might be:
Circuits-level interpretations about neural networks are fundamentally causal interpretations; they make claims such as “Neuron X activates and connects to neuron Y through weight Z, causing neuron Y to activate”. Many kinds of interpretability are similarly causal, but they abstract away the underlying circuits. For instance, feature visualization makes claims that ‘images that contain feature X cause neuron Y to fire maximally’ without reference to the circuits that achieve neuron Y’s selectivity to feature X. Similarly, Meng et al. (2022) use ‘causal tracing’ to isolate parts of a network that store factual associations, letting them modify the network to remove that knowledge without massively damaging performance. Redwood Research are doing significant work on causally grounded methods (Wang et al., 2022; Chan et al., 2022).
In general, it seems prudent to ground our interpretability methods firmly in the theory of causality to be sure that we’re making rigorous claims regardless of the level of abstraction. Although analyses grounded in causality are a gold standard, they’re not always easy to conduct in most areas of science. Mechanistic interpretability is thus in a unique position: It's easy to make causal inferences in artificial neural networks thanks to the relative ease of running experiments in silico compared with experiments in the physical world. Mechanistic interpretability therefore can and should have much higher standards of evidence than other similar domains of science such as biology.
The field of mechanistic interpretability has grown quickly over the last few years. It’s unclear to most researchers what lessons to draw from this and which actions to take.
A substantial fraction of the growth has been from new research teams associated with organizations. The number of independent researchers is harder to measure but has also been surging. The field should probably try to make it easier for independent researchers to contribute. This might happen through
If further growth seems positive, how should we do it? In general, growth strategies are dependent on AI timelines: If timelines are short, then waiting for researchers to climb the academic ladder seems suboptimal. Computational neuroscientists seem like a ready source of researchers with both relevant analytical skills and shared interests. Physicists, computer scientists, and engineers offer the potential for deep theoretical insights and practical skills.
As the field grows, we should increase our concerns with the health of the field. Questions such as “How to improve coordination between researchers to avoid wasted effort?” and “How should we encourage healthy norms on disagreements?” become relevant. Engaging with and integrating constructive criticism is also a key marker of field health.
Mechanistic interpretability is in a somewhat unique position compared with other domains of science in that most of it happens outside of academia. This has upsides and downsides with respect to publishing norms, epistemics, and coordination that should be carefully managed.
A strong barrier currently in place to people trying to get into the field is good tooling. There's a strong and thriving ecosystem for conventional ML (in particular, core libraries like PyTorch, TensorFlow and JAX, and the HuggingFace ecosystem), which makes ML much easier to get into. This is particularly important for academics, students and independent researchers. But ML infrastructure and tooling is optimized for being able to use models and to be computationally efficient, not to be able to easily expose and access the internals of models, intervene on them, and probe at how they work. So there's a lot of room for better ML mechanistic interpretability tooling. As an initial step in this direction, Neel Nanda has been developing a library called EasyTransformer. There's also a need for tooling that better integrates interactive visualizations and the web dev ecosystem into Python and ML workflows, as good visualizations are often key to understanding the high-dimensional objects of neural networks.
Despite being fundamentally interesting work, most researchers are scientifically invested in mechanistic interpretability because of its instrumental use for AI safety. In order to improve our positive impact through mechanistic interpretability research, we should have a carefully considered theory of impact. Neel Nanda (list) and Beth Barnes (list) have put together lists of pathways through which interpretability might contribute to AGI safety.
We should think carefully about the relationships between ‘level of progress in mechanistic interpretability’ and each ‘pathway to impact’. Not all pathways to impact are available at all levels of progress. For instance, if we use interpretability in the loss function before we have interpretability that is robust-to-training, we run a serious risk of simply training our networks to be good at hiding dangerous thoughts. We should therefore think carefully about interactions between these pathways to impact.
Even though mechanistic interpretability research appears to be one of the most promising pathways to AGI safety, many researchers are concerned about potential risks resulting from their research:
It is a very exciting time in mechanistic interpretability research. To some, it represents one of the most plausible paths to avoiding an AI catastrophe. The field is growing quickly and is beginning to see accelerating research progress. Fortunately, it enjoys a high degree of openness between individuals and organizations, which will be important to foster to keep up the pace of research on this urgent problem.
Conjecture is hiring! We’re currently running a hiring round for 9+ roles, including research engineers, ML Engineering leads and some non-technical roles. We’re based in London and are looking for people who are excited about directly cutting at alignment. Interviews are happening on a rolling basis. Apply by the 2nd of December 2022 to be considered for this round. If you have any questions, reach out to email@example.com. To apply and find out more see: https://www.conjecture.dev/careers.
Thanks for writing the post, and it's great to see that (at least implicitly) lots of the people doing mechanistic interpretability (MI) are talking to each other somewhat.
Some comments and questions:
Thank you very much for the detailed and insightful post, Lee, Sid, and Beren! I really appreciate it.
In the spirit of full communication, I'm writing to share my recent argument that mechanistic interpretability may not be a reliable safety plan for AGI-scale models.
It would be really helpful to hear your thoughts on it!
One other goal / theme of mechanistic interpretability research imo: twitter.com/norabelrose/status/1588571609128108033