This is a linkpost for https://aizi.substack.com/p/no-really-attention-is-all-you-need?sd=pf

Mentioned in

No Really, Attention is ALL You Need - Attention can do feedforward networks

5Neel Nanda

1Robert_AIZI

1astralbrane

1Robert_AIZI

3gwern

3Robert_AIZI

6gwern

New Comment

Cute construction! To check, am I correct that you're adding an attention head *per* neuron? To me that makes this prohibitive enough to not actually be useful for real models - eg, in GPT-2 Small that'd take you from 12 heads per layer to about 3,000 per layer.

That's right, the activation function sublayer needs 1 attention head per neuron. The other sublayers can get away with fewer - the attention sublayer needs the usual amount, and the linear transformation sublayer just needs enough to spread the rank of the weight matrix across the V matrices of the attention head. I'm most familiar with the size hyperparameters of GPT-3 (Table 2.1), but in full-size GPT-3, for each sublayer:

- heads for the attention sublayer

- heads for the weight matrix calculating into the hidden layer

- heads for the activation function

- heads for the weight matrix calculating out of the hidden layer

Aren't attention networks and MLPs both subsets of feedforward networks already? What you really mean is "Attention can implement fully-connected MLPs"?

Calling fully-connected MLPs "feedforward networks" is common (e.g. in the original transformer paper https://arxiv.org/pdf/1706.03762.pdf), so I tried to use that language here for the sake of the transformer-background people. But yes, I think "Attention can implement fully-connected MLPs" is a correct and arguably more accurate way to describe this.

Given the general contempt that MLPs are held in at present, and the extent to which people seem to regard self-attention as magic pixie dust which cannot be replicated by alternatives like CNNs or MLPs and which makes Transformers qualitatively different from anything before & solely responsible for the past ~4 years of DL progress (earlier discussion defending MLP prospects), ~~it might be more useful to emphasize the other direction: if you can convert any self-attention to an equivalent fully-connected MLP, then that can be described as ~~*"there is a fully-connected MLP that implements your self-attention"*. (Incidentally, maybe I missed this in the writeup, but this post is only providing an injective self-attention → MLP construction, right? Not the other way around, so converting an arbitrary MLP layer to a self-attention layer is presumably doable - at least with enough parameters - but remains unknown.)

Unfortunate that the construction is so inefficient: 12 heads → 3,000 heads or 250x inflation is big enough to be practically irrelevant (maybe theoretically too). I wonder if you can tighten that to something much more relevant? ~~My intuition is that MLPs are such powerful function approximators that you should be able to convert between much more similar-sized nets (and maybe smaller MLPs).~~

In either direction - perhaps you could just directly empirically approximate an exchange rate by training MLPs of various sizes to distill a self-attention layer? Given the sloppiness in attention patterns, it wouldn't necessarily have to be all that accurate. And you could do this for each layer to de-attend a NN, which ought to have nice performance characteristics in addition to being a PoC.

(My prediction would be that the parameter-optimal MLP equivalent would have a width vs depth scaling law such that increasing large Transformer heads would be approximated by increasingly skinny deep MLP stacks, to allow switching/mixing by depth. And that you could probably come up with an initialization for the MLPs which makes them start off with self-attention-like activity, like you can come up with Transformer initializations that mimic CNN inductive priors. Then you could just drop the distillation entirely and create a MLPized Transformer from scratch.)

Incidentally, maybe I missed this in the writeup, but this post is only providing an injective self-attention → MLP construction, right?

Either I'm misunderstanding you or you're misunderstanding me, but I think I've shown the opposite: any MLP layer can be converted to a self-attention layer. (Well, in this post I actually show how to convert the MLP layer to 3 self-attention layers, but in my follow-up I show how you can get it in one.) I don't claim that you can do a self-attention → MLP construction.

Converting an arbitrary MLP layer to a self-attention layer is presumably doable - at least with enough parameters - but remains unknown

This is what I think I show here! Let the unknown be known!

Unfortunate that the construction is so inefficient: 12 heads → 3,000 heads or 250x inflation is big enough to be practically irrelevant (maybe theoretically too).

Yes, this is definitely at an "interesting trivia" level of efficiency. Unfortunately, the construction is built around using 1 attention head per hidden dimension, so I don't see any obvious way to improve the number of heads. The only angle I have for this to be useful at current scale is that Anthropic (paraphrased) said "oh we can do interpretability on attention heads but not MLPs", so the conversion of the later into the former might supplement their techniques.

Yes, you're right. My bad; I was skimming in a hurry before heading out while focused on my own hobbyhorse of 'how to make MLPs beat Transformers?'. Knew I was missing something, so glad I checked. Now that you put it that way, the intuition is a lot clearer, and shrinking it seems a lot harder: one head per hidden dim/neuron is a straightforward construction but also unclear how much you could be guaranteed to shrink it by trying to merge heads...

The empirical approach, in both directions, might be the best bet here, and has the advantage of being the sort of thing that someone junior could get interesting results on quickly with minimal hardware.

[Epistemic status: Mathematically proven, and I haverunning codethat implements it.]Overview:Atransformerconsists of two alternating sublayers: attention heads and feedforward networks (FFNs, also calledMLPs). In this post I’ll show how you can implement the latter using the former, and how you can convert an existing transformer with FFNs into an attention-only transformer.My hope is that such a conversion technique can augment mechanistic interpretability tools such as the ones described in

A Mathematical Framework for Transformer Circuits, by reducing the task of interpretability from “interpret attention and FFNs” to just “interpret attention”. That publication specifically points out that “more complete understanding [of Transformers] will require progress on MLP layers”, which I hope this technique can supply.Limitations:SiLUand ReLU, but I haven’t found an easy way to produce GeLU, which is what GPT uses.NotationFix a transformer T (such as GPT-3) which uses attention and feedforward networks. Write D=dmodel for the internal dimension of the model, N=nctx for the number of vectors in the context, and X for the “residual stream”, the N-by-D matrix storing the internal state of the model during a forward pass.

We will assume that the feedforward networks in T consists of an MLP with one hidden layer of width dff=4dmodel, using an activation function α(x)=

SiLU(x)=xσ(x)^{[1]}. To simplify notation, we will assume that bias terms are built into the weight matrices W1 and W2, which are respectively of sizes D-by-4D and 4D-by-D, so that the output of the feedforward network is α(XW1)W2, where α is applied to the matrix entry-wise.We’ll follow

this notationfor attention heads, so that an attention head is characterized by its query-key matrix Q=WQK and its output-value matrix V=WOV, each of size D-by-D^{[2]}. To simplify notation, we will assume that the “/√dk” step of attention has been folded into the Q matrix. Then the output of the attention head is softmax[XQ(XT)]XV, where the softmax operation is applied row-wise.We assume that both the feedforward network and attention heads make use of skip connections, so that their output is added to the original residual stream. However, we ignore layer normalization.

Throughout, we will rely on a large number Ω whose purpose is to dwarf other numbers in the softmax operation of an attention head. In particular, we assume Ω has two properties:

In my code, Ω=1000 is sufficient for a tolerance of ε=10−10.

Construction OverviewWe will convert the attention-and-feedforward model T into an attention-only model T’ by augmenting the residual stream, replacing the feedforward sublayers with attention sublayers, and tweaking the original attention heads to maintain their original behavior on the augmented residual stream.

We augment the residual stream of the model by:

N’ = N+1. The new context vector will act as a “bias context vector” which we use to implement the entrywise SiLU function.D’=D+4D+(N+1).The 4D additional dimensions will be used to store the intermediate calculations of the FFN network. Then N+1=N’ additional dimensions act as 1-hot positional encodings.In T, each layer consists of two sublayers:

Multi-headed attention.Feedforward network.In T’, these are replaced by:

Multi-headed attention.This acts identically to the original transformer, though the Q and V matrices are slightly tweaked to avoid issues arising from introducing the “bias context” vector.Linear transformation via attention heads.This transformation emulates XW1 by reading from the D-width residual stream and writing to the 4D-width residual stream corresponding to the hidden layer. The Q matrix makes each vector only attends to itself, and the V matrix contains a copy of W1.Entry-wise SiLU to hidden layers via attention heads.Using one attention head per dimension, we apply the activation function to the “hidden dimensions” computed in the previous step, resulting in α(XW1). The Q matrix makes each vector attend only to itself and the final “bias context” vector, split in proportion to σ(x). The V matrix makes a vector write the negative of its entry and the bias context vector write 0, resulting in an entrywise SiLU.Linear transformation via attention heads.This step emulates multiplying by W2 and adding it back to the D-width residual stream. This step also zeroes out the 4D-width part of the residual stream corresponding to the hidden layer, readying them to be written to by the next layer. The Q matrix makes each vector only attends to itself, and the V matrix contains a copy of W2.The following sections will discuss these steps in the order (3), (2+4), (1), which is descending order of novelty to me.

Entry-wise SiLU via attention headsOne can apply SiLU to the residual stream with one attention head per dimension being SiLU’d. One uses the following Q and V matrices:

With this Q matrix, the jth row of XQXT will be of the form [−xjk,−xjk,…,−xjk,2Ω−xjk,−xjk,….,−xjk,2Ω], where k is the dimension being SiLU’d, and the 2Ωs are in the jth entry and the final entry. Then, after applying the softmax to this row, the row becomes [0,0,…,0,1−σ(xjk),0,…,0,σ(−xjk)] (to within error). That is, every vector is attended by only itself and the bias vector.

By our choice of V, the influence of a vector is the negative of its entry in the kth position. Thus the jkth entry of softmax[XQ(XT)]XVis −xjk(1−σ(xjk)), so after adding to the residual stream, one gets that the jkth entry of X+softmax[XQ(XT)]XV is xjkσ(xjk)=SiLU(xjk), as desired.

Vector-Wise Linear Transformations via Attention HeadsBy putting such large weights in the self-positional-encoding matrices, a vector attends entirely to itself. Thus the output of the attention head is entirely the result of the V matrix, which can contain the arbitrary linear transformation of the feedforward network. Additional comments:

Tweaking the Original Attention Heads to Preserve Their BehaviorThe addition of the new vector used for the activation function could potentially change the attention patterns of the preexisting attention heads, which would change the behavior of the network. However, we can slightly tweak the attention matrices in a normal attention head to prevent this issue:

By augmenting the attention matrix in this way, the bias vector strongly avoids attending to the non-bias vectors, and strongly attends to itself (preventing non-bias vectors from attending to the bias vector).

Demonstration CodeI’ve put Python code implementing this technique

on github. Each of the three components (SiLU, linear transformations, normal attention) are implemented both directly and with attention heads. They are tested on random matrices with N=20 and D=30, and the largest error entries in each matrix are on the order of 10−14. I have not tested how such errors propagate through multiple layers.Conclusionsuch as this work.GPT-3the largest learnable rank is dhead=128, but both the SiLU heads and the linear transformation heads use attention patterns of rank nctx=2048.^{^}One can also approximate ReLU with this technique, since SiLU(kx)/k → ReLU(x) as k→infinity. AIAYN uses ReLU, but GPT-3 uses GeLU.

^{^}For implementation purposes, these matrices are usually learned as low-rank factorizations, with WQK=WQWTK and a similar expression for WOV. However, it’s easier to construct the desired properties if we treat them in their full form. We will ignore rank restrictions except in the concluding comments.