Mech Interp Lacks Good Paradigms

Daniel Tan

Note: I wrote this post rather quickly as an exercise in sharing rough / unpolished thoughts. I am also not an expert on some of the things I've written about. If you spot mistakes or would like to point out missed work / perspectives, please feel free!

Note 2: I originally sent this link to some people for feedback, but I was having trouble viewing the comments on the draft. The post was also in a reasonably complete state, so I decided to just publish it - and now I can see the comments! If you're one of those people, feedback is still very much welcome!

Mechanistic Interpretability (MI) is a popular and rapidly growing field of technical AI safety research. As a field, it's extremely accessible, requiring comparatively few computational resources, and facilitates rapid learning, due to a very short feedback loop. This means that many junior researchers' first foray into AI safety research is in MI (myself included); indeed, this occurs to the extent where some people feel MI is over-subscribed relative to other technical agendas. However, how useful is this MI research?

A very common claim on MI's theory of impact (ToI) is that MI helps us advance towards a "grand unifying theory" (GUT) of deep learning. One of my big cruxes for this ToI is whether MI admits "paradigms" which facilitate correct thinking and understanding of the models we aim to interpret.

In this post, I'll critically examine several leading candidates for "paradigms" in MI, consider the available evidence for / against, and identify good future research directions (IMO). At the end, I'll conclude with a summary of the main points and an overview of the technical research items I've outlined.

Towards a Grand Unifying Theory (GUT) with MI

Proponents of this argument believe that, by improving our basic understanding of neural nets, MI yields valuable insights that can be used to improve our agents, e.g. by improving architectures or by improving their training processes. This allows us to make sure future models are safe and aligned.

Some people who have espoused this opinion:

Richard Ngo has argued here that MI enables “big breakthroughs” towards a “principled understanding” of deep learning.
Rohin Shah has argued here that MI builds “new affordances” for alignment methods.
Evan Hubinger has argued for MI here because it helps us identify “unknown unknowns”.
Leo Gao argues here that MI aids in “conceptual research” and “gets many bits” per experiment.

As a concrete example of work that I think would not have been possible without fundamental insights from MI: steering vectors, a.k.a. representation engineering, and circuit breakers, which were obviously inspired by the wealth of work in MI demonstrating the linear representation hypothesis.

It's also important to remember that the value of fundamental science often seems much lower in hindsight, because humans quickly adjust their perspectives. Even if MI insights seem like common sense to us nowadays, their value in instrumenting significant advances can't be overstated.

(Aside) A corollary of this argument is that MI could likely have significant capabilities externalities. Becoming better at building powerful and instruction-aligned agents may inadvertently accelerate us towards AGI. This point has been made in depth elsewhere, so I won't elaborate further here.

A GUT Needs Paradigms

Paradigm - an overarching framework for thinking about a field

In his seminal book, The Structure of Scientific Revolution, Thomas Kuhn catalogues scientific progress in many different fields (spanning physics, chemistry, biology), and distills general trends about how these fields progress. Central to his analysis is the notion of a "paradigm" - an overarching framework for thinking about a field. Kuhn argues that the establishment of accepted paradigms is a sign of "maturity" in the development of a field.

Paradigms Are Instrumental for Progress

In the absence of a paradigm, it's very hard to draw the right conclusions from data, for two reasons.

Multiple hypotheses could explain the data. Kuhn argues that, in the absence of a paradigm, a reasonable researcher might reach "any one of a number of incompatible conclusions". For example, we might incorporate variables with no actual predictive power into our explanation, like an ancient guru using the motions of stars to predict the future. The variables we choose to use are a function of our prior experience in other field and "accidents" in the process of our investigation.

We may not have sufficiently good mental abstractions to understand what we're seeing.

A famous thought experiment in neuroscience considers what results popular interpretability techniques would yield on microprocessors, which are an example of a complex information-processing systems that we understand at all levels. Techniques such as "lesion experiments" (a.k.a activation patching) completely fail to elucidate the underlying structure - without a more advanced interpretation, MI experiments aren't very useful.

As another example of how an absence of paradigms leads to illusions, consider Lucius Bushnaq's thought experiment on interpreting a hypothetical feature tracking entropy of a physical system.

It seems to sort of activate more when the system is warmer. But that's not all it's doing. Sometimes it also goes up when two separated pockets of different gases mix together, for example. Must be polysemantic.

In a pre-paradigmatic field, Kuhn argues that all research amounts to no more than a "fact-gathering" exercise conducted mostly at random. The value of a paradigm is in forcing concerted and systematic study of a common set of phenomena in a standardised way, facilitating subsequent progress in the field. Furthermore, by concentrating attention on "mysteries" which the paradigm fails to explain, a paradigm sets up subsequent work to find the next paradigm.

Three Desiderata for Paradigms

Distilling and synthesizing Kuhn's treatment, a paradigm has three important properties.

A paradigm has good epistemics, i.e. we believe it to be true because it explains existing data very well, or because it is strongly implied by other things we assume to be true.
A paradigm is general, i.e it applies to many seemingly-distinct cases with a high degree of accuracy. For example, Newton's laws of gravitation explain both the motion of celestial bodies and the trajectories of thrown objects.
A paradigm is open-ended, i.e. can be easily used as a building block for future work. For example, the results associated with a paradigm may result in useful practical applications. (Note: this necessitates a minimum degree of rigour.) Alternatively, attempting to validate the premises associated with a paradigm might point the way to very informative experiments.

In the subsequent discussion, I will continually return to these criteria for evaluating subsequent paradigms (or "precursor" paradigms).

Examining Paradigms in Mechanistic Interpretability

Our most "paradigm-y" things at the moment include:

A Mathematical Framework of a Transformer
Linear Representation Hypothesis
Superposition Hypothesis

TLDR; my view is that these all fall short in some way. In the first case, it actually is a great paradigm, just insufficient for what we want. In the latter two cases, they're not sufficiently rigorous to serve as building blocks for theory.

A Mathematical Framework of a Transformer Circuit

The mathematical framework blogpost, published by Anthropic in 2021, is a seminal example of what I consider to be a great work introducing a paradigm in MI, perhaps my favourite, that has pushed the field forward a lot.

An overview:

The computation performed by a transformer can be linearly decomposed into a large number of individual functions or "circuits", each of which is almost-linear.
Attention in particular can be thought of as a composition of a "QK" circuit and an "OV" circuit.
This is based on a rigorous derivation from first principles, using only the mathematical properties of transformer components.

Epistemics: 4.5/5

Unlike some of the later candidates here, the mathematical framework is rigorously proven. It makes nontrivial predictions about transformers circuits that have been empirically validated time and time again.
The only reason it's not at a 5/5 is due to unresolved holes, e.g. the explanation of one layer transformers as computing "skip trigrams" is incorrect (as Buck Shlegeris has noted) and failing to consider the effect of layer normalization, which turns out to be important for, e.g., mediating the model's confidence in its predictions.
However I think that these holes in the theory mostly amount to "book-keeping", and can be resolved with some technical care.
I also think that these holes don't matter that much in practice, and the wealth of successful work on analyzing transformer circuits is significant evidence in favour of this point.

Generality: 5/5

The mathematical framework makes very few assumptions, and applies to all transformers, no matter the domain. Hence we expect it to apply to an extremely wide range of use cases.

Open-Endedness: 5/5

This paradigm provides a crisp and elegant way of thinking about the computation performed by a model. It's the closest we've ever gotten to fully describing a deep learning architecture mathematically. And it is easy to build upon, as evidenced by the wealth of subsequent "transformer circuits" work.

My main criticism is that "A Mathematical Framework" is not high-level enough. As excellent as it is, this paradigm feels like a level of abstraction "below" what we want.

What "A Mathematical Framework" gives us is a complete, rigorous description of a model based on multiplications of very high-dimensional matrices.
However, we should not be satisfied with high-dimensional matrices if they still contain structural information, as Jake Mendel has argued here.
Furthermore, we still need to unpack the incredibly dense and high-dimensional matrix multiplications (a "mathematical description") into something we understand (a "semantic description"), as Lee Sharkey argues in his Sparsify research agenda.

Concrete analogy: Answering biological questions with chemistry. If we think of a biological organism (aka a model), "Mathematical Framework" is like a complete description of its chemistry (aka circuitry). The latter is definitely foundational and informationally-complete w.r.t the former. At the same time, it is totally insufficient for answering higher-level questions like "does it have goals?"

The Linear Representation Hypothesis

An overview of the LRH.

The atomic operations in a model are linear (matrix multiplication, adding, ...) Therefore, we expect the model to represent things in a linearly-accessible way.
Empirically, we've been able to find linear representations for a surprisingly wide variety of things; e.g. truth, sentiment, general "behaviours / tendencies" [steering vectors, RepE], and even emergent properties of the world [board states in OthelloGPT, geographical location in real LLMs]

Epistemics: 3/5

The theoretical basis for why models should represent most things linearly falls short of being fully rigorous, and the argument seems hand-wavy. Contrast this to the rigour of "A Mathematical Framework".
Furthermore, we know that this is not 100% true, since neural networks have nonlinearities. But the boundary where we expect linearity to break down is not clearly identified. It's far more reasonable to think that linearity only holds locally. This is a simple consequence of smooth functions being well-approximated by a linear function at sufficiently small scales. For ReLU networks specifically, we actually have exact linearity within a local "polytope" region. C.f. the Polytope Lens
I'm concerned that some of the empirical conclusions are based on illusions. See for example, my previous work on steering vectors which found that steering vectors can appear to be effective in aggregate, when in actuality the effectiveness of a steering vector can vary widely between individual examples, and it's quite common that steering vectors just encode for completely spurious factors.

Generality: 2/5

The LRH argues that the model represents "features" as directions in activation space. However, "feature" is not well-defined, and any attempt to define it in terms of a model's representations results in a circular definition.
Model ontology can be different from human ontology. If something isn't linearly represented, is because of a mundane empirical reason, or because the concept we've identified is actually not a ground-truth feature?
Furthermore, we have clear evidence of things which are represented in a nonlinear geometry, and which LRH fails to explain. Some features are represented in a circular geometry [Not all features are linear], while others are represented as simplices [belief state, geometry of categorical / hierarchical concepts].

Open-Endedness: 2/5

My biggest problem with LRH is that it's very difficult to make meaningful predictions a priori with it.

There are too many possible confounders. If we try finding a linear representation of X and fail, it's really not obvious whether it's because of mundane reasons like the dataset or the specific model being investigated being bad, or because of some deep underlying reason like "X is not represented linearly".
Even in the latter case above, interpretation matters a lot. The things that are linear are not necessarily obvious a priori, as this work on OthelloGPT shows.
Where the LRH has been invoked most commonly is when we do find something that seems to be linear, and LRH is our go-to explanation for why. However, a theory that only makes things obvious in hindsight is not what we want from a paradigm.

The Superposition Hypothesis

An overview of superposition.

Assumption 1: There are many, many underlying features which are used in the data-generating process.
Assumption 2: These features are sparse, i.e. on average, features co-occur very rarely.
Corollary: Because of the sparsity of features, models can "get away" with representing more features than the dimensionality of their representation space. Much like an airline booking too many passengers on the same flight, they tolerate the extremely low probability that "too many" features are active at the same time.
Theorem: Models of a given width lossily approximate "idealized" models of a much greater width. In the idealized model, all neurons are monosemantic. Polysemanticity occurs in the models we observe because of this desire to "compress" many features into fewer features.

I think the assumptions here are quite reasonable, which facilitates a high generality score. However, the epistemics could be better.

Epistemics: 3/5

The largest body of evidence for superposition so far is that we consistently observe polysemantic neurons in models, across all sizes, architectures, and domains. Polysemanticity is self-evident.

However: polysemanticity and superposition are not the same thing. Polysemanticity is an empirical phenomenon (some neurons appear to encode multiple concepts). Superposition is a specific theory that attempts to explain polysemanticity. Lawrence chan makes this point in more detail here.
In that post, he also explains some alternative explanations for polysemanticity without superposition, which I find convincing: (i) Maybe there are not more features than dimensions; they simply aren't neuron aligned. C.f. feature composition. (ii) Neurons are nonlinear functions, so "directions" will by definition by polysemantic.

Secondly, superposition has also been demonstrated in toy models. However, it's unclear what disanalogies there are between superposition in toy models and superposition in real LLMs. For example, the assumptions about the data may not hold, or maybe two-layer ReLU networks are qualitatively quite different from much deeper transformer architectures.

The last main evidence in favour of superposition is that interp approaches inspired by superposition, i.e. sparse autoencoders, have seemed to work really well. However, this evidence is also not very solid.

"some of the SAE features appear to be human-interpretable" is not a very convincing standard of evidence, especially when they are subject to cherry-picking / streetlighting. More on this in "Rigor" below.
Superposition may not be the only reason why SAEs empirically result in low loss. and I'm concerned that we have not sufficiently excluded alternative hypotheses.

As a specific alternative hypothesis on what SAEs are doing, a common argument is that SAEs simply cluster the data, and interpretability comes from having tighter clusters.

As a quick thought experiment on why SAEs might be better described as "clustering", consider a model of animals with 2 underlying variables: leg length and tail length. In this fictional world, there are 3 types of animals: snakes (long tails, short legs), giraffes (long legs, long tails), and dogs (short legs, long tails). SAEs will likely recover one feature per animal type, and appear to be interpretable as a result, but they have failed to recover the underlying compositional variables of leg length and tail length. (credit: Jake Mendel)
I expect that SAEs definitely do clustering to some extent, which is why we observe feature splitting. However, to what extent? Would similar clustering methods like k-means result in similar Pareto curves of L0 vs reconstruction? I would be excited to see / do work that explores this hypothesis.

Generality: 4/5

As discussed above, I'm reasonably confident in the validity of the assumptions made by superposition on the data. I expect "many sparse features" to be a good characterization of many domains, including language, text, audio, and many more. Hence I think superposition is highly general, conditioned on it being true.

Open-Endedness: 5/5

Assuming superposition as a concept turns out to be basically correct, it illuminates a lot of useful follow-up work.

the most obvious being "how do we take things out of superposition" in order to be interpretable. C.f. all the work subsequently on sparse dictionary learning / sparse autoencoders. I won't elaborate on that here as it's been done elsewhere.

A point which I have not seen made elsewhere: I think we have not yet fully elucidated the "kinds" of superposition that can occur in models. Arbitrarily complex things can be in superposition, and features could just be the tip of the iceberg.

Superposed features is the simplest kind of superposition; representing many bits in memory with fewer bits. Note that this does not make any claims on how this information is used or modified.
Recently, we've expanded this framework to consider attention-head superposition (first introduced by Anthropic here). Here, models superpose many heads into a few heads, due to individual heads being unable to implement multiple conflicting circuits. There has since been plenty of follow-up work elucidating the concept in more depth: 1, 2, 3, 4.
More generally, I conjecture that end-to-end "circuits" can also be in superposition.

Overall, I am actually quite confident that superposition is essentially correct. That is why I'm currently working on circuit analysis using SAEs. But I think there's insufficient evidence atm to reject leading alternative hypotheses and cement its status as a paradigm.

Other Bodies of Theory

Note: I am considerably less familiar with these other bodies of theory than I am with the preceding 3, so there may be errors or inaccuracies here. Please feel free to point those out as necessary

There are some other bodies of theory which currently don't make the top cut, but which I think are promising nonetheless as things which could yield paradigms, given time.

Singular Learning Theory

Ratings are offered here, but they should be taken with a very large pinch of salt because I am not very familiar with SLT

Overview:

SLT provides rigorous results on "singular" (a.k.a overparametrized / underdetermined) model classes, of which neural nets are a subclass.
In principle, this seems like exactly what we want - mathematically rigorous and general theory applicable in diverse scenarios.
It remains to be seen whether SLT makes nontrivial predictions about commercially-used language models. As far as I'm aware, the largest-scale study is on 1-layer and 2-layer transformers.

Epistemics: 5/5

There is a decades-old body of rigorous math behind SLT. Similar to "A Mathematical Framework", the results proven are mathematically rigorous, so I'm confident that they hold.

Generality: 3/5

In principle, because SLT covers all things within the class of "singular" models, the claims are extremely general - even more so than "A Mathematical Framework", which only applies to neural nets, and even more specifically only transformer architectures.

However, I'm not confident that SLT results are general, for the following reasons

The canonical analysis that I am aware of seems to be on 2-layer feedforward ReLU networks on toyish synthetic data.
More recent work steps up to 1-layer and 2-layer transformer networks trained on language modelling tasks and linear regression, which is a big step up.
However, it's very unclear to me (as a layperson) whether the claims about phase transitions in the training dynamics generalise to deeper neural networks, or whether they are subtly different.

Open-Endedness: 3/5

The biggest argument in favour of SLT is that it makes nontrivial predictions about models, e.g. by predicting phase transitions during training, and has been argued to predict grokking by Jesse Hoogland.
However, training dynamics are out-of-scope of what the average MI researcher cares about, which is interpreting a specific model. Thus there's a disconnect between what the MI community values and what SLT offers.
SLT seems to me to be rather inaccessible to the general MI / ML researcher, and I have a perception that it requires a lot of foundational reading before you can start doing "real work".
More broadly, it's really unclear to me (as a layperson) how SLT "interfaces" with the wider MI literature.
I would really like to see more examples where SLT is capable of predicting nontrivial phenomena a priori, and where this is then validated subsequently by experimental hypothesis.

On the whole, I think there is a lot of untapped potential here for SLT to be a paradigm, but this potential is quite far from being fully realized at the moment due to both issues with communicating the foundations of SLT to the broader MI community and a lack of "killer applications".

Computational Mechanics

No rating is offered here because I haven't engaged sufficiently with the material. I'm including Comp Mech mostly for completeness

The seminal work that I am aware of is about the fact that transformers model belief states using simplices, which are a specific representational geometry. Brief comments here:

It seems very logical that language models would conduct implicit Bayesian inference over hidden variables in order to better predict the next token.
There is already preliminary evidence which suggests models perform implicit probabilistic inference, using "entropy neurons" [elucidated in both Gurnee's work and Wu and Stolfo].
Understanding the representational geometry here may allow us to better elucidate this.

I don't have great takes on what Comp Mech aims to do as a field, and in any case it hasn't made significant impact (yet) on the MI literature. I'll revisit this in the future if it becomes relevant.

The Polytope Hypothesis

This is the idea that the correct atom for models' feature geometry are "polytopes". This is a very nascent trend I observe in some recent papers [Kiho Park et al, Adam Shai et al, Polytope Lens, and circular features in day-of-the-week math]. I intend to write a more complete note about this in a follow-up work.

Distilling A Technical Research Agenda

Note: Here, I summarize technical research items proposed in previous sections, which I think would be exciting. I've omitted the "other bodies of theory" for now because I think my takes will not be very good.

Generally: More killer applications of existing paradigms.

A point which I haven't made above is that the overall tone taken so far is probably too pedantic for the practical AI safety researcher.
There is a reasonable argument that we don't need all of these paradigms to be fully rigorous; just for them to yield some measurable empirical improvement in alignment outcomes.
Therefore, finding "killer applications" of paradigms, even incomplete ones, ranks very highly as exciting research across all paradigms.
As discussed above, circuit breakers are a great example of this.

On "Mathematical Framework":

Connecting high-level semantic descriptions more closely with the mathematical objects of study (transformer circuits).
- Causal scrubbing is a specific example of this, since it allows us to quickly test causal hypotheses about the mechanisms that models implement.
- Auto-interpretability might also yield good ways of labelling circuits, but we'd first need good auto-interp for features
- I had a specific proposal while back about activation pattern SVD which could help with interpreting features.
Unifying this with superposition, and seeing what arises out of that. For example, AMF states that we can linearly decompose a transformer's computation into many "direct paths". Are these direct paths also superposed in practice? (i.e. do models approximate having many more direct paths than they actually do)

On the LRH:

Elucidating the limits of the LRH. Clearly, not all things are linear. But exactly what "kinds" of things are not linear? A crisper understanding here would be really exciting. Ambitiously, it would be very cool to have a top-down taxonomy of concepts in the model's ontology, and an explanation of why each of these things are linear or not
Specifically: crisp counterexamples. The work on circular representations in day-of-the-week math is a great example here, and I would very much like to see further work in this direction.

On superposition:

Elucidating the "shape" of superposition in LLMs. As discussed above, "feature superposition" is only the tip of the iceberg. Do we have crisp examples of more complex kinds of superposition, such as attention-head superposition? Can circuits be in superposition?
Investigating alternative hypotheses for what SAEs are doing. For example, the "data clustering" hypothesis discussed previously. I'm sure there are others.

Conclusion

In summary, I think it's important to critically evaluate whether MI has succeeded in delivering general paradigms with high explanatory power over nontrivial phenomena. My take on this so far is that we have a few separate attempts but all of these are lacking at the moment. That's fine, since incomplete paradigms are still useful, and this highlights good avenues for future research.

Acknowledgements

Thanks to Egg Syntax, Jake Mendel, Robert Kirk, Joseph Bloom for useful feedback and discussions!

40