Mech interp is not pre-paradigmatic

[-]Adam Shai6moΩ10173

Thanks for writing out your thoughts on this! I agree with a lot of the motivations and big picture thinking outlined in this post. I have a number of disagreements as well, and some questions:

It's unfortunate that mech interp inherits the CNC paradigm, because despite many years of research, turns out it's really hard to do computational science on brains, so computational neuroscience hasn't made a huge amount of progress.

I strongly agree with this, and I hope more people in mech. interp. become aware of this. I would actually emphasize that in my opinion it's not just that it's hard to do computational science on brains, but that we don't have the right framework. Some weak evidence for this is exactly that we have an intelligent system that has existed for a few years now where experiments and analyses are easy to do, and we can see how far we've gotten with the CNC approach.

My main point of confusion with this post has to do with Parameter Decomposition as a new paradigm. I haven't thought about this technique much, but on a first reading it doesn't sound all that different from what you call the second wave paradigm, just replacing activations with parameters. For instance, I think I could take most of the last few sections of this post and rewrite it to make the point. Just for fun I'll try this out here, trying to argue for a new paradigm called "Activation Decomposition". (just to be super clear I don't think this is a new paradigm!)

You wrote:

Parameter Decomposition makes some different foundational assumptions than used by the Second-Wave.
One of these assumptions arises because Parameter Decomposition offers a clear definition of 'feature' that hitherto eluded Second-Wave Mech Interp. What Second-Wave Mech Interp calls a 'feature' can be defined as ‘properties of the input that activate particular mechanisms’. Notably, 'features' are here defined with reference to mechanisms, which is great, because 'mechanisms' has a specific formal definition!
This re-definition implies a difference in one of the foundational assumptions of Second-Wave Mech Interp that ‘features are the fundamental unit of neural network’. Parameter Decomposition rejects this idea and contends that ‘mechanisms are the fundamental unit of neural networks’.

and I'll rewrite that here, putting my changes in bold:

Activation Decomposition makes some different foundational assumptions than used by the Second-Wave.
One of these assumptions arises because Activation Decomposition offers a clear definition of 'feature' that hitherto eluded Second-Wave Mech Interp. What Second-Wave Mech Interp calls a 'feature' can be defined as ‘properties of the input that activate particular mechanisms’. Notably, 'features' are here [in Activation Decomposition] defined with reference to mechanisms [which are circuits of linearly decomposed activations], which is great, because 'mechanisms' has a specific formal definition!
This re-definition implies a difference in one of the foundational assumptions of Second-Wave Mech Interp that ‘features are the fundamental unit of neural network’. Activation Decomposition rejects this idea and contends that ‘mechanisms are the fundamental unit of neural networks’.

Perhaps a simpler way to say my thought is, isn't the current paradigm largely decomposing activations? If that's the case why is decomposing parameters so fundamentally different?

I think maybe one thing that might be going on here is that people have been quite sloppy (though I think it's totally excusable and arguably even a good idea to be sloppy about these particular things given the current state of our understanding!), with words like feature, representation, computation, circuit, etc. Like I think when someone writes "features are the fundamental unit of neural networks" they are often meaning something closer to "representations are the fundamental unit of neural networks" or maybe something closer to "SAE latents are the fundamental unit of neural networks" and importantly, an implicit "and representations are only really representations if they are mechanistically relevant." Which is why you see interventions of various types in current paradigm mech interp papers.

Some work in this early period was quite neuroscience-adjacent; as a result, despite being extremely mechanistic in flavour, some of this work may have been somewhat overlooked e.g. Sussillo and Barak (2013).

This is a nitpick, and I don't think any of your main points rests on this, but I think the main reason this work was not used in any type of artificial neural network interp work at that time was that it is fundamentally only applicable to recurrent systems, and probably impossible to apply to e.g. standard convolutional networks. It's not even straightforward to apply to a lot of the types of recurrent systems used in AI today (to the extent they are even used), but probably one could push on that a bit with some effort.

As a final question, I am wondering what you think the implications for what people should be doing are if mech interp is or is not pre-paradigmatic? Is there a difference between mech interp being in a not-so-great-paradigm vs. pre-paradigmatic in terms of what your median researcher should be thinking/doing/spending time on? Or is this just an intellectually interesting thing to think about. I am guessing that when a lot of people say that mech interp is pre paradigmatic they really mean something closer to "mech interp doesn't have a useful/good/perfect paradigm right now". But I'm also not sure if there's anything here beyond semantics.

[-]Lee Sharkey6mo*Ω350

Hey Adam, thanks for your thoughts on this!

I strongly agree with this, and I hope more people in mech. interp. become aware of this. I would actually emphasize that in my opinion it's not just that it's hard to do computational science on brains, but that we don't have the right framework. Some weak evidence for this is exactly that we have an intelligent system that has existed for a few years now where experiments and analyses are easy to do, and we can see how far we've gotten with the CNC approach.

I think we're on the same page that we might not have the right framework to do computational science on brains or other intelligent systems. I think we might disagree on how far away current mainstream ideas are from being the right framework - I'd predict that, if we talked it out further, I'd say we're closer than you'd say we are. I don't know how far afield from current ideas that we need to look for the right framework, and I'd support work that looks even further afield than several inferential steps from current mainstream ideas. But I don't think the historical sluggish pace of computational neuroscience justifies search any particular inferential distance; more proximal solutions feel just as likely to be the next paradigm/wave than more distant solutions (maybe more likely given the social nature of what constitutes a paradigm/wave).

My main point of confusion with this post has to do with Parameter Decomposition as a new paradigm.

I really want to re-emphasize that I didn't call PD a new paradigm (or even a new 'wave') in the post. N.B.: "I’ll emphasize that these are early ideas and certainly do not yet constitute ‘Third-Wave Mech Interp’. "

I haven't thought about this technique much, but on a first reading it doesn't sound all that different from what you call the second wave paradigm, just replacing activations with parameters. For instance, I think I could take most of the last few sections of this post and rewrite it to make the point. Just for fun I'll try this out here, trying to argue for a new paradigm called "Activation Decomposition". (just to be super clear I don't think this is a new paradigm!)

Yeah I don't think PD throws away the majority of the ideas in the 2nd wave. It's designed primarily to solve the anomallies of the 2nd wave. It will therefore resemble 2nd wave ideas and we can draw analogies. But I think it's different in important ways. For one, I think it will probably help us be less confused about ideas like 'feature', 'representation', 'circuit', and so on.

> Some work in this early period was quite neuroscience-adjacent; as a result, despite being extremely mechanistic in flavour, some of this work may have been somewhat overlooked e.g. Sussillo and Barak (2013).

This is a nitpick, and I don't think any of your main points rests on this, but I think the main reason this work was not used in any type of artificial neural network interp work at that time was that it is fundamentally only applicable to recurrent systems, and probably impossible to apply to e.g. standard convolutional networks. It's not even straightforward to apply to a lot of the types of recurrent systems used in AI today (to the extent they are even used), but probably one could push on that a bit with some effort.

Yes this is fair. These are still fairly deep neural networks, though (if we count time as depth), and they're examples of work that interprets ANNs on the lowest level using low-level analysis of weights and activations using e.g. dimensionality reduction and other methods mech interp folks might find familiar. But I agree it doesn't usually get put in the bucket of 'mech interp' though ultimately the boundary is fairly arbitrary. As a separate point, it is surprising how little of the neuroscience community has actually jumped onto mechanistically understanding more interesting models like inception v2 or LLMs despite the similarity of methods and object of study, which is a testament to the early mech interp pioneers since they saw a field where few others did.

As a final question, I am wondering what you think the implications for what people should be doing are if mech interp is or is not pre-paradigmatic? Is there a difference between mech interp being in a not-so-great-paradigm vs. pre-paradigmatic in terms of what your median researcher should be thinking/doing/spending time on? Or is this just an intellectually interesting thing to think about. I am guessing that when a lot of people say that mech interp is pre paradigmatic they really mean something closer to "mech interp doesn't have a useful/good/perfect paradigm right now". But I'm also not sure if there's anything here beyond semantics

I'm not actually sure if this is very action-relevant. I think in the past I might have said "mech interp practitioners should be more familiar with computational neuroscience/connectionism", since I think this might have save the mech interp community some time. But I don't think it would have saved a huge amount of time, and I think the mech interp has largely surpassed comp neuro as a source of interesting and relevant ideas. I think it's mostly useful as an exercise in situating mech interp ideas within the broader set of ideas of an eminently related field (comp neuro/connectionism). But I'll stress that many in the field see mech interp as better contextualized by other sets of broader ideas (e.g. as a subfield of interpretability/ML), and when viewing mech interp in light of those ideas, it might better be thought of as pre-paradigmatic. I think that's a completely compatible but different perspective from the one I tend to take, and just emphasizes the subjectiveness of the whole question of whether the field is paradigmatic or not.

[-]Adam Shai6mo40

This all sounds very reasonable to me! Thanks for the response. I agree that we are likely quite aligned about a lot of these issues.

[-]TsviBT6moΩ36-2

So, as a field, we don't have to be happy with the dominant paradigm. But just because we're not happy with it doesn't mean it's not there.

Um, ok fine, so what alternative term do you propose to replace "pre-paradigmatic" as it is currently used, to indicate that there's no remotely satisfactory paradigm in which to get going on the parts of the field-to-be that really matter?

[-]Daniel Tan6mo53

Seems pretty straightforward to say “mech interp lacks good paradigms” (actually 1 syllable shorter than “mech interp is pre-paradigmatic”!)

See also my previous writing on this topic: https://www.lesswrong.com/posts/3CZF3x8FX9rv65Brp/mech-interp-lacks-good-paradigms

[-]TsviBT6mo22

No we need actual words for concepts. It's important to have specifically words.

[-]shawnghu6mo10

What's the distinction between what you're pointing at and the statement that mech interp lacks good paradigms? I think the latter statement is true and descriptive, but I presume you want to say something else.

[-]TsviBT6mo31

It's the same statement, plus an additional set of implications that come from reification. I want to say "mech interp (and AGI alignment and other things) is pre-good-relevant-paradigm". (Which people have been expressing as "pre-paradigm".) I want to say

There's this category of areas called "pre-good-relevant-paradigm".
Mech interp is one of those.
This category has a bunch of features in common that make them a cluster / a Thing.
This Thing has implications, e.g. about how to orient to research and fieldbuilding for that area.

This is much more easily done with a word for an adjective-like concept. It plants a flag and asserts its Thinghood. People can talk and coordinate about it. Words are good.

[-]shawnghu6mo30

I don't disagree in general with the claim that words can be useful for coordinating about natural ideas. The thing that's missing here is my understanding that there's a particular natural idea here that isn't captured by "mech interp lacks good paradigms".

Is anything which lacks a good+relevant paradigm by default "pre-good-relevant-paradigm", or is there more subtlety to the idea?

[-]TsviBT6mo40

E.g. "...and it seems like there could exist good paradigms for this area, and we probably want good paradigms for this area, and our current work in this area ought to be shaped by the fact that we're pre-paradigm, and...."

[-]Paul Bogdan4mo10

Cool post. I have a neuro background, and I'm sometimes asked about "Is neuro actually informative for mech interp," so I'm interested in this point about CNC being the current paradigm. I have a few thoughts:

Are the paradigmatic ideas of mech interp from neuroscience?

You mention some examples of paradigmatic ideas:

The idea that networks "represent" things;
That these "representations" or computations can be distributed across multiple neurons or multiple parts of the network;
That these representations can be superposed on top of one another in a linear fashion, as in the 'linear representation hypothesis' (e.g. Smolensky, 1990);
That representations can form representational hierarchies, thus representing more abstract concepts on top of less abstract ones, such as the visual representational hierarchy.

These ideas are all from back in the 1960-1990s? My impression was that back then the different cognitive sciences, like neuroscience and AI were more mixed up. For example, Geoffrey Hinton worked in a psychology department briefly, and many of the big names of this age were "cognitive scientists." So in that sense, it's a reach to really call these neuroscience ideas?

That being said, there's another point that comes to mind that you didn't mentioned but I think can be more firmly called neuroscience: Neural networks organize themselves to efficiently encode information (https://en.wikipedia.org/wiki/Efficient_coding_hypothesis).

My impression is that CS departments mostly set aside the above theoretical ideas until the last five years, whereas neuroscience departments kept thinking about them. Additionally, although something like AlexNet used superposition and had polysemantic neurons, those weren't discussed until the last 2010s. Because neuroscience kept thinking about them whereas CS departments didn't, maybe it is fair to call them neuroscientific. However, I'm not sure how many theoretical advancements in computational neuroscience from 1990-2020 actually contributed to modern mech interp? Which would be an argument against calling them neuroscience.

Are neuroscientific methods used in mech interp?

You give some examples of methods:

Max-activating dataset examples are basically what Hubel and Wiesel (1959) (and many researchers since) used to demonstrate the functional specialisation of particular neurons.
Causal interventions, commonly used in mech interp, are the principle behind many neuroscience methods, including thermal ablation (burning parts of the brain), cooling (thus ‘turning off’ parts of the brain), optogenetic stimulation, and so on.
Key data analysis methods, such as dimensionality reduction or sparse coding, that are used extensively in computational neuroscience (and sometimes directly developed for it) are also used extensively in mech interp.

These examples are mentioned a lot when discussing neuroscience and mech interp. However, some of these parallels feel a bit more surface-level than they might first appear, and one might be able to claim a parallel between mech interp and any scientific field. For instance, ablating neurons in LLMs and the brain is very common. However, upregulating/downregulating something is the most basic type of experimental manipulation and is used by basically every scientific field. Maybe when comparing mech interp and neuroscience, it's generally worth stopping to ask: is the biggest similarity that the two are both working on neurons. If this part is set aside and you abstract a bit, can you make the same parallel to virtually every other scientific field?

Some other parallels between mech interp and neuro, however, are more niche and seemingly compelling. For example, I like the use of dimensionality reduction to visualize and search for cycles in activation space.

Where would mech interp be today if computational neuroscience never really existed. Would mech interp have come at the exact same methods? Something like ablation or upregulation I think undoubtedly. Maybe the tendency to use dimensionality reduction for visualization a bit less so (or something similar would have been developed but slightly different). It seems hard to make clear claims about where mech interp would be today if computational neuroscience didn't exist.

Are neuroscientific and mech interp findings similar?

You give an example:

And in many cases, the standards of what constitute a legitimate contribution to the field are the same. In both, for instance, a legitimate contribution might include a demonstration that a neuron (whether in a brain or an artificial neural network) appears be involved in an interesting representation or computation, such as the ‘Jennifer Anniston’ neuron (Quiroga et al. 2005) or the ‘Donald Trump’ neuron (Goh et al. 2021).

This is an interesting point that I haven't seen before. I think this is pretty fair and maybe a unique parallel, but it would be more correct to say that a legitimate contribution is that the brain/LLM performs some function using some specific interesting computation: e.g., hippocampal neurons often represent spatial information in terms of the association between distinct items (e.g., my monitor is above my desk), whereas LLMs do addition with possibly unintuitive circuits/mechanisms? However, when framed as so and distanced a bit from neurons, can we make a similar parallel to any scientific field? Don't most scientific endeavors try to decompose functions into more precise functions?

In the end, if you want to take any scientific field in the world, call it the existing paradigm for mech interp, and decide that paradigm can't be ML, then I can't imagine anything better than computational neuroscience... and this is a clear argument, but this seems like a low bar.

[-]idly5mo10

Explainable AI and interpretable ML research and methods, aside from the researchers affiliated with the rationalist scene, are for some reason excluded from the narrative. Is it really your view that 'mechanistic interpretability' is so different that it is an entirely different field? Doesn't it seem a bit questionable that the term 'mechanistic interpretability' was coined in order to distance Olah's research from other explanation approaches that had been found to have fundamental weaknesses - especially when mechanistic interpretability methods repeatedly fall prey to the exact same points of failure? The failure of SDL latents was unsurprising, the fact that it took such a long time for someone to call attention to it should have provoked much more of a discussion on how science is done in this community.

I agree with the similarities to neuroscience, and there is definitely much to learn from that field, but it would be an even easier step to just read a little more widely on interpretable/explainable machine learning and causal discovery, in which there is a wide body of literature discussing the very issues you mention and more. Why is research done outside of the self-labelled 'mechanistic interpretability' community mostly ignored? In neuroscience, if you prefer though, perhaps Jonas & Kording 2017 is relevant: Could a Neuroscientist Understand a Microprocessor? | PLOS Computational Biology https://share.google/WYGmCXAnX8FNbaRqi

[-]Matt Levinson6mo10

Love the way you laid things out here! Lots to discuss, but I'll focus on one specific question. We've communicated privately so you know I'm very bullish on PD as a potential new paradigm. Don't take the below as general skepticism!

The requirement that the parameter components sum to the original parameters also means that there can be no ‘missing mechanisms’. At worst, there can only be ‘parameter components which aren’t optimally minimal or simple’.

I don't understand this claim, except perhaps in a trivial sense which I'm assuming you don't mean. My confusion stems from my intuition that we don't have a good reason or evidence to assume that the model never needs the full magnitude of a particular parameter in two different mechanisms that do different things that are not related functionally, semantically, nor in any but an ignorable sense geometrically.

Is the statement above implicitly conditional on the assumptions of PD, one of which is that my intuition is false? If not, then in the worst case it seems to me the only other option is that the block quote is only trivially true in the sense that if there are many overlapping mechanisms where the full magnitude of a subset of the overlapping parameters is needed, then the PD degenerates to a single mechanism that is the original model. Would be interested to hear your thoughts!

[-]shawnghu6mo10

So Parameter Decomposition in theory suggests solutions to the anomalies of Second-Wave Mech Interp. But a theory doesn’t make a paradigm.

Nitpicking a little bit here, I think this is a different use of the word "theory" than the use in the phrase "scientific theory". One could think you mean the latter in its second usage here, but it seems like you're making a claim more like "these things could make progress explaining some of these things, if the experiments go well".

> The requirement that the parameter components sum to the original parameters also means that there can be no ‘missing mechanisms’. At worst, there can only be ‘parameter components which aren’t optimally minimal or simple’.

Echoing a part of Adam Shai's comment, I don't see how this is different from the feature-based case. Won't there be a problem if you extract a bunch of parameter components you "can explain", and then you're left with a big one you "can't explain", which "isn't optimally minimal or simple"?

> Another attractive property of Parameter Decomposition is that it identifies Minimum Description Length as the optimization criterion for our explanations of neural networks

Why is this an attractive property? (Serious question.)

[-]A2z6mo-10

To add to this historical retrospective on interpretability methods: Alternatively, we can use a parameter decomposition of a bottleneck ("exemplar") layer over a model with non-identifiable parameters (e.g., LLMs) to make a semi-supervised connection to the observed data, conditional on the output prediction, re-casting the prediction as a function over the training set's labels and representation-space via a metric-learner approximation. How do we know that the matched exemplars are actually relevant, or equivalently, that the approximation is faithful to the original model? One simple (but meaningful) metric is whether the prediction of the metric-learner approximation matches the class of the prediction of the original model, and if they do not, the discrepancies should be concentrated in low probability regions. Remarkably, relatively simple functions over the representation space and labels achieve that property. This line of work was introduced in the following paper, which appeared in the journal Computational Linguistics and was presented at EMNLP 2021: "Detecting Local Insights from Global Labels: Supervised & Zero-Shot Sequence Labeling via a Convolutional Decomposition" https://doi.org/10.1162/coli_a_00416. When projection to a lower resolution than the available labels is desired for data analysis (e.g., from the document level to the word level for error detection), we then have a straightforward means of defining the inductive bias via the loss (e.g., encouraging one feature detection per document, or multi-label detections, as shown in subsequent work). In other words, we can decompose an LLM's document-level prediction to word-level feature detections, and then map to the labeled data in the support set. This works with models of arbitrary scale; the dense matching via the distilled representations from the filter applications of the exemplar layer requires computation on the order of commonly used dense retrieval mechanisms, and the other transforms require minimal additional overhead.

The key advantage relative to alternative interpretability methods (across the paradigms presented here) is it then leads to methods with which we can close the loop on the connection between the data, the representation space, [the features,] the predictions, and the predictive uncertainty. See my works from 2022 and more recently for further details (e.g., https://arxiv.org/abs/2502.20167), including incorporating uncertainty-aware verification and interpretability-by-exemplar as intrinsic properties of the LLMs, themselves (i.e., for both train- and test-time search/compute/generation), rather than as a post-hoc approximation.

^{^}

Arguably, first wave mech interp starts even earlier than this, with Hinton et al. (1986) "Learning representations by back-propagating errors", where they studied the weights of the first layer of a network. But it's typical to constrain the object of study of mech interp to be deep networks, thus excluding work prior to ca. 2012, which is fine and understandable though somewhat arbitrary.

^{^}

There is no date on the blog post, but see this commit history for the date of when it was posted.

^{^}

For example: Is a feature 'present' if it is linearly readable from any activation space? What if it's nonlinearly readable? Is the representation 'present' in that case?

^{^}

Or, if activations are not orthogonal to particular directions in parameter space, we should still be able to ablate them if they lead sub-threshold pre-activations, which have no downstream causal effect.

^{^}

A new paper that makes some progress on the issues of robustness and scalability should be coming soon!

^{^}

LESSWRONG
LW

LESSWRONG
LW

209

Mech interp is not pre-paradigmatic

209

Ω 79

209

Ω 79

Preamble: Kuhn, paradigms, and paradigm shifts

Claim: Mech Interp is Not Pre-paradigmatic

First-Wave Mech Interp (ca. 2012 - 2021)

The Crisis in First-Wave Mech Interp

Second-Wave Mech Interp (ca. 2022 - ??)

Anomalies in Second-Wave Mech Interp

The Crisis of Second-Wave Mech Interp (ca. 2025 - ??)

Toward 'Third-Wave' Mechanistic Interpretability

The Basics of Parameter Decomposition

Parameter Decomposition Questions Foundational Assumptions of Second-Wave Mech Interp

Parameter Decomposition In Theory Resolves Anomalies of Second-Wave Mech Interp

Conclusion