Adam Shai — LessWrong

Thanks for writing out your thoughts on this! I agree with a lot of the motivations and big picture thinking outlined in this post. I have a number of disagreements as well, and some questions:

It's unfortunate that mech interp inherits the CNC paradigm, because despite many years of research, turns out it's really hard to do computational science on brains, so computational neuroscience hasn't made a huge amount of progress.

I strongly agree with this, and I hope more people in mech. interp. become aware of this. I would actually emphasize that in my opinion it's not just that it's hard to do computational science on brains, but that we don't have the right framework. Some weak evidence for this is exactly that we have an intelligent system that has existed for a few years now where experiments and analyses are easy to do, and we can see how far we've gotten with the CNC approach.

My main point of confusion with this post has to do with Parameter Decomposition as a new paradigm. I haven't thought about this technique much, but on a first reading it doesn't sound all that different from what you call the second wave paradigm, just replacing activations with parameters. For instance, I think I could take most of the last few sections of this post and rewrite it to make the point. Just for fun I'll try this out here, trying to argue for a new paradigm called "Activation Decomposition". (just to be super clear I don't think this is a new paradigm!)

You wrote:

Parameter Decomposition makes some different foundational assumptions than used by the Second-Wave.
One of these assumptions arises because Parameter Decomposition offers a clear definition of 'feature' that hitherto eluded Second-Wave Mech Interp. What Second-Wave Mech Interp calls a 'feature' can be defined as ‘properties of the input that activate particular mechanisms’. Notably, 'features' are here defined with reference to mechanisms, which is great, because 'mechanisms' has a specific formal definition!
This re-definition implies a difference in one of the foundational assumptions of Second-Wave Mech Interp that ‘features are the fundamental unit of neural network’. Parameter Decomposition rejects this idea and contends that ‘mechanisms are the fundamental unit of neural networks’.

and I'll rewrite that here, putting my changes in bold:

Activation Decomposition makes some different foundational assumptions than used by the Second-Wave.
One of these assumptions arises because Activation Decomposition offers a clear definition of 'feature' that hitherto eluded Second-Wave Mech Interp. What Second-Wave Mech Interp calls a 'feature' can be defined as ‘properties of the input that activate particular mechanisms’. Notably, 'features' are here [in Activation Decomposition] defined with reference to mechanisms [which are circuits of linearly decomposed activations], which is great, because 'mechanisms' has a specific formal definition!
This re-definition implies a difference in one of the foundational assumptions of Second-Wave Mech Interp that ‘features are the fundamental unit of neural network’. Activation Decomposition rejects this idea and contends that ‘mechanisms are the fundamental unit of neural networks’.

Perhaps a simpler way to say my thought is, isn't the current paradigm largely decomposing activations? If that's the case why is decomposing parameters so fundamentally different?

I think maybe one thing that might be going on here is that people have been quite sloppy (though I think it's totally excusable and arguably even a good idea to be sloppy about these particular things given the current state of our understanding!), with words like feature, representation, computation, circuit, etc. Like I think when someone writes "features are the fundamental unit of neural networks" they are often meaning something closer to "representations are the fundamental unit of neural networks" or maybe something closer to "SAE latents are the fundamental unit of neural networks" and importantly, an implicit "and representations are only really representations if they are mechanistically relevant." Which is why you see interventions of various types in current paradigm mech interp papers.

Some work in this early period was quite neuroscience-adjacent; as a result, despite being extremely mechanistic in flavour, some of this work may have been somewhat overlooked e.g. Sussillo and Barak (2013).

This is a nitpick, and I don't think any of your main points rests on this, but I think the main reason this work was not used in any type of artificial neural network interp work at that time was that it is fundamentally only applicable to recurrent systems, and probably impossible to apply to e.g. standard convolutional networks. It's not even straightforward to apply to a lot of the types of recurrent systems used in AI today (to the extent they are even used), but probably one could push on that a bit with some effort.

As a final question, I am wondering what you think the implications for what people should be doing are if mech interp is or is not pre-paradigmatic? Is there a difference between mech interp being in a not-so-great-paradigm vs. pre-paradigmatic in terms of what your median researcher should be thinking/doing/spending time on? Or is this just an intellectually interesting thing to think about. I am guessing that when a lot of people say that mech interp is pre paradigmatic they really mean something closer to "mech interp doesn't have a useful/good/perfect paradigm right now". But I'm also not sure if there's anything here beyond semantics.

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai7h30Review for 2024 Review

This is a self review. It's been about 600 days since this was posted and I'm still happy and proud about this post. In terms of what I view as the important message to the readership, the main thing is introducing a framework and way of thinking that connects what is a pretty fuzzy notion of "world model" to the concrete internal structure of neural networks. It does this in a way that is both theoretically clear and amenable to experiments. It provides a way to think about representations in transformers in a general sense, that is quite different than the normal tactic taken in interpretability. One interesting thing given how interp has progressed since then, is that it provides a way to think about geometric structures in the activations of these networks, which is something the field seems to be moving towards (albeit not guided by theory the way this work is).

I spent a lot of time/effort on making the presentation both easy to follow while actually containing the important lessons. I feel this post does a great job at that. (and in fact, others around me were telling me I was taking too long and that I just needed to hit submit - I'm glad I didn't follow that advice :)

This work has been expanded on in a number of directions (a recent post summarizes three of those direction). We now have Simplex, an entire org that would likely not exist in the way that it does without this post (we got seed funding right before this post, but the ability to fundraise and attract talent were highly dependent on this post)! We are ~10 people and expanding. Starting this org, watching it grow, being part of the building up of a new way to think about/do interpretability has been the greatest intellectual achievement of my life, and again, I do not think this would have occured (at least not the way it did), without this post.

Some of the things we are working on is extending the framework to more complicated (factorizable) generative structures, which gives way to both a very nice story explaining how it is neural nets can cram so much into their activations, and the corresponding geometric relationship to computational structure. We also have a team dedicated to red-teaming the theory and application to experiments in a number of ways, one of which is more precisely testing the distinction between representing next-token vs. far future representaitons. We are also looking for the types of geometric structures our theory predicts in LLMs, and building tools to do unsuprvised finding of these geometric structures in LLMs.

In retrospect there are a few things I would have emphasized or done differently. The most concrete one would have been to include experiments on RRXOR in this writeup, which more directly test/show representations that have information beyond the next token, and that are predicted by the theory.

Another regret is that I haven't done a good job of keeping up with the public facing side of our work since this post. We made one post about our work since then, and it really does hide a lot of the insights we've made internally. This is not great, since I do believe that we have refined our way of thinking about representations and computations in these systems, in a way that the interp community needs.

More personally (and perhaps this isn't relevant for a review since it really is just about me), this post came after switching fields from neuroscience, where I was for a decade, to interpretability. There was a lot of uncertainty and stress about that move, and quite a bit of wondering if I could actually contribute. So it was a big deal for me to see the community react positively to this work. Additionally, I had been interested in computational mechanics for a decade from within neuroscience, since it felt like it could say something about computation in biological neural nets, but I could never figure out how to get it to do anything useful. So it's also a privilege to see the dream play out and be realized in neural nets, and to start a research org dedicated to that vision!

The start of the research program presented here is in a real sense a bet, in that it could fail to be useful at scale, for a few different reasons. But the fact that we are building up with both theory and experiment in a way where each informs the other, and that we were able to show the community a small part of what this looks like, is a lesson I hope the readership takes, even beyond the specific content of this post. (hot take) I think a lot of great work happens within this community, but too often the more theoretical and the more empirical approaches don't talk to eachother as much as they should.

Adam Shai's Shortform

Adam Shai2mo40

Pre-registering a prediction for experiment results testing why and when and how attention heads share information. We (simplex) will train transformers on data generated from the cartesian product of sequences generated from 2 independent Mess3 processes. If the degenerate eigenvalue of each is the same and positive (e.g. lambda=0.7 for both) then a transformer with a single attention head will learn it, and the attention patterns will be 0.7^(d-s) where d-s is the number of context positions between the source and desitination. If instead one of the processes has lambda = 0.7 and another lambda = 0.3, then this will seperate info. between two heads, where one has attention pattersn of 0.7^(d-s) and the other 0.3^(d-s). If instead we have one process with lambda=0.7 and the other with lambda = -0.3, then this will require 3 heads, with .7^(d-s) and the 0.3^(d-s) seperating between two heads to account for the fact that attention has to be positive. BUT if we have one process with lambda = 0.7 and the other with lambda = -0.7, then this will require 2 heads! With the 0.7^(d-s) being able to be shared on one head for the lambda=0.7 process, and for the -0.7 process when d-s is even, and then the odd d-s cases will be on the other head.

StefanHex's Shortform

Adam Shai3mo130

A fun side note, that probably isn't useful - I think if you shuffle the data across neurons (which effectively gets rid of the covariance amongst neurons), and then do linear regression, you will get theta_t.

This is a somewhat common technique in neuroscience analysis when studying correlation structure in neural reps and separability.

A quantum equivalent to Bayes' rule

Adam Shai3mo140

Somewhat related, in our recent preprint we showed how Bayesian updates work over quantum generators of stochastic processes. It's a different setup than the one you show here, but does give a generalization of Bayes to the quantum and even post-quantum setting. We also show that the quantum (and post-quantum) Bayesian belief states are what transformers and other neural nets learn to represent during pre-training. This happens because to predict the future of a sequence optimaly given some history of that sequence, the best you can do is to perform Bayes on the hidden latent states of the (sometimes quantum) generator of the data.

Max Niederman's Shortform

Adam Shai4mo30

In some sense I agree but I do think it's more nuanced than that in practice. Once you add in cross-entropy loss on next token prediction, alongside causal masking, you really do get a strong sense of "operating on sequences". This is because next token prediction is fundamentally sequential in nature in that the entire task is to make use of the correlational structure of sequences of data in order to predict future sequences.

Mech interp is not pre-paradigmatic

Adam Shai6moΩ10173

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments