This is a linkpost for https://www.youtube.com/watch?v=8Pjj_SNIyN4
This post is a link to a video I just made in response to the new work coming out of Anthropic described here and here and discussed on LessWrong here.
In the video I try to puzzle through how best to think about the MLP layers of a transformer in the same spirit as Anthropic is thinking through the self-attention layers.
Well, goodness, it's really impressive (and touching) that someone absorbed the content of our paper and made a video with thoughts building on it so quickly! It took me a lot longer to understand these ideas.
I'm trying to not work over the holidays, so I'll restrict myself to a few very quick remarks:
There's a bunch of stuff buried in the paper's appendix which you might find interesting, especially the "additional intuition" notes on MLP layers, convolutional-like structure, and bottleneck activations. A lot of it is quite closely related to the things you talked about in your video.
You might be interested in work in the original circuits thread, which focused on reverse engineering convolutional networks with ReLU neurons. Curve Detectors and Curve Circuits are an deep treatment of one case and might shed light on some of the ideas you were thinking about. (For example, you discussed what we call "dataset examples" for a bit.)
LayerNorm in transformers is slightly different from what you describe. There are no interactions between tokens. This is actually the reason LayerNorm is preferred: in autoregressive transformers, one needs to be paranoid about avoiding information leakage from future tokens, and normalization across tokens becomes very complicated as a result, leading to a preference for normalization approaches that are per-token. In any case, there are some minor flow through effects from this to other things you say.
Most transformers prefer GeLU neurons to ReLU neurons.
In general, I'd recommend pushing linearization back until you hit a privileged basis (either a previous MLP layer or the input tokens) rather than the residual stream. My guess is that's the most interpretable formulation of things. It turns out you can always do this.
I think there's another important idea that you're getting close to and I wanted to remark on:
Thanks! Enjoy your holidays!