A brief research note by Chris Olah about the point of mechanistic interpretability research. Introduction and table of contents are below.

# Interpretability Dreams

*An informal note on the relationship between superposition and distributed representations by Chris Olah. Published May 24th, 2023.*

Our present research aims to create a *foundation* for mechanistic interpretability research. In particular, we're focused on trying to resolve the challenge of superposition. In doing so, it's important to keep sight of what we're trying to lay the foundations for. This essay summarizes those motivating aspirations – the exciting directions we hope will be possible if we can overcome the present challenges.

We aim to offer insight into our vision for addressing mechanistic interpretability's other challenges, especially *scalability*. Because we have focused on foundational issues, our longer-term path to scaling interpretability and tackling other challenges has often been obscure. By articulating this vision, we hope to clarify how we might resolve limitations, like analyzing massive neural networks, that might naively seem intractable in a mechanistic approach.

Before diving in, it's worth making a few small remarks. Firstly, essentially all the ideas in this essay were previously articulated, but buried in previous papers. Our goal is just to surface those implicit visions, largely by quoting relevant parts. Secondly, it's important to note that everything in this essay is almost definitionally extremely speculative and uncertain. It's far from clear that any of it will ultimately be possible. Finally, since the goal of this essay is to lay out our personal vision of what's inspiring to us, it may come across as a bit grandiose – we hope that it can be understood as simply trying to communicate subjective excitement in an open way.

### Overview

**An Epistemic Foundation**- Mechanistic interpretability is a "microscopic" theory because it's trying to build a solid foundation for understanding higher-level structure, in an area where it's very easy for us as researchers to misunderstand.**What Might We Build on Such a Foundation?**- Many tantalizing possibilities for research exist (and have been preliminarily demonstrated in InceptionV1), if only we can resolve superposition and identify the right features and circuits in a model.**Larger Scale Structure**- It seems likely that there is a bigger picture, more abstract story that can be built on top of our understanding of features and circuits. Something like organs in anatomy or brain regions in neuroscience.**Universality**- It seems likely that many features and circuits are universal, forming across different neural networks trained on similar domains. This means that lessons learned studying one model give us footholds in future models.**Bridging the Microscopic to the Macroscopic**- We're already seeing that some microscopic, mechanistic discoveries (such as induction heads) have significant macroscopic implications. This bridge can likely be expanded as we pin down the foundations, turning our mechanistic understanding into something relevant to machine learning more broadly.**Automated Interpretability**- It seems very possible that AI automation of interpretability may help it scale to large models if all else fails (although aesthetically, we might prefer other paths).

**The End Goals**- Ultimately, we hope this work can eventually contribute to safety and also reveal beautiful structure inside neural networks.

Some very interesting and inspiring material.

I was fascinated to see that https://distill.pub/2021/multimodal-neurons/#emotion-neurons provides some clear evidence for emotion neurons in CLIP rather similar to the ones for modeling author's current emotional state that I hypothesized might exist in LLMs in https://www.lesswrong.com/posts/4Gt42jX7RiaNaxCwP/?commentId=ggKug9izazELkRLun As I noted there, if true this would have significant potential for LLM safety and alignment.

We will need to do more than just make an effort to solve the problem of interpretability with current machine learning models. We need to work more on constructing new AI systems that are inherently more decipherable and mathematical than the AI systems that we have now and which organize the parameters of the model in a way that is inherently more understandable to experts with the right tools. Hopefully this will work much better than just being a tradeoff between performance and decipherability.

For example, a vector-valued word embedding is problematic since one will represent many different meanings of a token with a vector. One should instead have a word embedding where the individual meanings of the tokens are represented as vectors while the token itself represents text as a matrix (since matrices are very good at representing collections of vectors).

Areas of mathematics such as complex analysis and quantum information theory have many very nice theorems (such as the uniformization theorem), and such theorems should be applied to create more mathematical machine learning models which are easier to interpret due to their mathematical behavior.

Added 5/25/2023

It is also a good idea for the random variable XA of models trained using the training set A to have a small amount of entropy (there are several measures of entropy to choose from). If XA has high entropy, then one will need to sort through all of this random information in order to interpret the trained model. And after one is done interpreting the model, there is more work to do because if we train the model again with different initial conditions, then we will need to rework our initial interpretation for our retrained model. It may be too much to ask for the random variable XA to have zero entropy for all cases, but we should find as many use cases as possible when XA has little or no entropy. And if it is not feasible for neural networks to have zero entropy, then we should search for new kinds of machine learning models (I never said that this is going to be easy, but it is something that we need to do).