Simply reverse engineering gpt2-small 
(Layer 0, Part 1: Attention)

gammagurke

Motivation

GPT2-small is able to generate, and thus apparently understand, language to a surprisingly good degree. So we’d like to find out how it works.

There have been attempts to explain some parts of this network, but there remains confusion about how those parts interact with the rest of the network and whether we’ve found everything those parts do. There is a lot of confusion in general in this field, about “features”, “polysemanticity”, “sparseness” and how and where we can use those concepts.

When I imagine a general solution to Interpretability, it tends to be an algorithm that takes some network as an input and generates a fully annotated flow chart that can be modularly understood and verified, such that every part of the network can be decomposed into units that are relevant for the algorithm, each of them straightforwardly understandable and all of them adding up to the network’s behaviour.

So in order to not add to the confusion, I’ve decided to do the most unconfusing thing I can do: Simply do this by hand and reverse engineer GPT2-small as far as I can, starting with layer 0. This is pretty convenient, since before the first attention block there is no interaction between tokens, so all the data is only dependent on the embeddings. On later layers, the space of concepts that could be represented is much larger, but in layer 0 there hasn’t yet happened enough nonlinearity to get complicated concepts. Then, if we’re successful enough in mapping the concepts that have formed in layer 0, we might use this understanding to recursively analyse the next layer. Even if that proves too much to be humanly possible, the insights we gain during this process might be useful for automating it.

We should not introduce new black boxes in the process. Partly because of this, I am not relying on any previous research. Any reinvented wheels will hopefully become less confusing through reinvention. I have a colab notebook that includes practically all of my code, available for anyone reading this to look at and use. I hope that this analysis is understandable for anyone familiar with just the basics of linear algebra and neural network architectures. The point of this is not to invent/apply Interpretability Techniques^TM but to find a framework in which the techniques to apply next come from a few minutes of thinking and common sense.

I’m also trying to avoid using training/testing data in order to understand the function of the model. It would be comparatively very easy to look at what some part of the network does on particular inputs, especially when looking at the inputs this part maximally activates on. But that approach would lead to less understanding of how the model does something and wouldn’t necessarily generalise out of distribution. Any structure we find in the model’s function would first have to be carefully analysed to confirm that it comes from the structure of the model, not the structure of the training distribution. Once we have an idea of “what” the model is doing and “how”, for the “why” we might need the training distribution again.

Note on Model Structure

There are lots of resources online to get an overview of the GPT-2 model architecture and we will only need the very start, so I won’t go into a full explanation here. A few important points that can get overlooked:

The positional embedding is learned
There is a layernorm in each subblock, but no layernorm acts directly on the residual stream until the very end
Each layernorm, in addition to neuronwise learned weights, also has neuronwise learned biases
Each of the linear attention layers (Q, K, V, O) has its own biases

Positional embedding

The positional embedding on its own is just a 1024 x 768 matrix that maps each position to an embedded vector that gets added on to the residual stream. But, if we look at its columns instead of rows (i.e. take its transpose), we now have a matrix that shows for each neuron of the residual stream, how much that neuron is activated by each possible position. We can easily graph this, so let’s look at those neurons that seem to get activated the most on average (L2-Norm):

Figure 1: residual stream neurons, position embedding

Clearly, there is a lot of structure here, and it seems to be neuron-aligned!

To get a better overview, let’s group the first 60 of these by their linear regression gradient:

Groups 1 and 3 consist of mostly linear functions, all of them having similar gradient magnitude.

Group 2 consists of almost sine-like periodic functions, all of them with period 1/2 of the context window length. Their phases are shifted to be almost uniformly distributed over the entire space. Unlike normal sine curves, these seem to consist of almost linear sections, again with similar gradient magnitude. At the start and end of the context window, the curves compress slightly, and especially for the first ~10 positions, their behaviour is sometimes completely different.

Apparently, what the algorithm actually uses for handling positional data anywhere but the very start is a single linear function and some collection of periodic (with period 1/2 of the context window length), piecewise linear functions with offset phases. Both of those are stored with massive redundancy.

At this point, we can speculate about how this kind of positional embedding will be used, and why the network stores the information in this form (apparently a constant period with lots of different phases is better than lots of different periods, against all Fourier intuition). Anyone reading this should feel free to do so, but for now we know all we need to know about the positional embedding and we can come back to it later when we see it being used.

Token embedding

What I want to do here is not to find an exhaustive list of features with which to construct a base of token-embedded space, like we’ve just found for the position. For now, all I’m trying to achieve is to gain enough intuitive understanding of how this space works to be able to think about things that interact with it.

This embedding is a fairly normal word embedding with respect to word relationships

In word embeddings, there usually are semantic relationships between words that can also be found in their embedded representations. A common example is the semantic relationship “King is to Queen as man is to woman”, so the corresponding vectors V_King+(V_woman - V_man) should approximately equal V_Queen.

This is also observable in gpt2-small’s token embedding, but the effect isn’t quite this reliable.

Looking at this shift:

Here, it seems like king is too sticky a word, so that the man->woman shift isn’t quite enough to get closer to queen than king. What stands in the way?

Both of these would be a 0-sum if the 4 vectors formed the intuitive parallelogram. As it is, they give us the shift we’re still missing from V_King+(V_woman - V_man) to V_Queen , so we can ask what other tokens lie in the same direction. In each direction we get the two tokens that contribute positively and semantically similar ones, which isn’t too surprising, but we also get an association with Jordan and adultery going away from “queen” and an association with bees going towards “queen”. These make sense as examples for the difference between “king” and “queen” that goes beyond male-female (Jordan has a king).

Obviously most words used in actual language are going to have other associations that aren’t fully captured by one such shift. The space of “Information that can be stored about a word” is rather large, so while this method can give us some of those associations, we’re not yet looking for a full accounting of which information is stored. For now, if we want a cleaner shift from one word to another, we can go for even more generic words.

There we found one triple for which the fourth word is closest to their sum. Keep in mind that the shift (“houses” - “house”) just gets us far enough away from “car” that “cars” is now closer than it, which doesn’t mean that these match up perfectly.

Looking at the differences, we can see that “houses” also has a verb sense not captured by the plural shift, so that similar verbs like “protects” lie in that direction. It also makes sense that e.g. “garage” is more associated with “cars” than “car”.

Linear general knowledge

Now we can start thinking about how semantic information is stored in the network.

These words are less generic than in the previous section, so it’s not surprising that we can’t get away from Merkel as the closest token, but for each of these countries, we get a contemporary politician from that country within the top 5.

This shows how this sort of knowledge is simply part of the linear structure of the embedding. We wouldn’t be seeing this if these were all atomic sparse features that get compressed into superposition.

I think it makes much more sense to think of the token “Merkel” as represented in the residual stream by something like [proper noun+person+leadership+administrative+(whatever makes up germany)+other stuff], so that answering the question “Who is the government leader of germany?” doesn’t require any specific knowledge neurons, just navigating to pretty much the semantic centre of mass of the sentence’s nouns and converting “who” into a boost for [proper noun, person].

In fact, even without this boost, Merkel comes 32nd, after only synonyms and similar words to the three nouns. The next person to come up is Hitler (in 50th place), which suggests another series of tests.

It seems as though there is another dimension or bunch of dimensions measuring something like the historical timeframe of a token. Again, I’m not trying to construct the entire list of concepts by which the embedding decomposes wordspace, just trying to get a good enough understanding of how this space works.

The word embedding and position embedding are mostly separate

With the singular value decomposition (SVD) of the token embedding matrix, we can find directions of decreasing importance to the embedding. Those directions seem surprisingly interpretable, but unfortunately they are not quite enough to find an interpretable basis of the space. But we can compare this importance rating for directions in the residual stream from a token embedding point of view with the structure we found for the position embedding. For each of the neurons in decreasing order of positional importance, we look at its projection onto this basis.

As we can see, the ~50 most positionally important neurons are projected almost exclusively onto the ~100 least important directions for the token embedding. Apparently the network is making a fairly strong effort to keep the two embeddings separate from each other.

As we saw in figure 1, past the top 50 there are still neurons that carry positional information, and we can also see some of them following the same pattern here, but except for a few outliers it falls off quickly and there are practically none past 100.

Loss testing

By just setting some collection of the embedding weights to 0, we can ensure that the spaces for position and tokens don’t overlap, then look at how the loss behaves. Here, I am just sorting the residual space’s neurons as in figure 1, and zeroing out weights to neurons after/before some cutoff.

The residual stream neurons are sorted by their importance for the position embedding, then e.g. for mask cutoff 60:
-orange: the first 60 of those neurons are removed (weights set to 0) from the token embedding
-blue: all but the first 60 of those neurons are removed from the position embedding
-green: both the first 60 removed from token embedding, and the last 768-60=708 removed from the position embedding, leaving the two embeddings completely separate
-red: gpt2-small without any changes

This looks pretty much as expected. On the position side, the loss decreases with larger cutoff, but is nearly equal to the unpatched loss from neuron 60 onward. On the token side, the loss increases with a larger cutoff, small exception at the start. The combined patch gives a loss difference close to the sum of the individual differences.

Loss decreased

When cutting off token data from the ~30 most important position neurons, the loss actually goes down.

A noticeable decrease in loss just by adjusting weights should be really hard to achieve, since gradient descent should just be able to find those adjustments. In this case, I circumvented this by cheating, because I deviated from the model architecture. In the model, the token embedding and the unembedding matrix are simply transposes of each other. This makes sense, as it reduces the number of parameters by a lot and it should in principle be good to have token embeddings have a constant meaning (it makes for example the logit lens possible). Here, by applying a patch to the embedding, but not the unembedding, I created a model that gradient descent couldn’t have found because the architecture forbids it.

My theory for the mechanism:

Suppose that it is really important for the token data not to introduce noise to the position data. Then, once the training has carved out some specific space for the position data, the pressure on the token embedding is to get out of that space as much as possible. But for the unembedding, that part of the residual stream is just free real estate that would be wasteful to leave empty. What we get is a small projection onto that part of the space that lets us decrease the loss if we take it out of only the embedding but not the unembedding.

So we have ~55 neurons dedicated to position data, ~40 neurons that carry minimal position data and some token data but not so much that it makes a large difference to the loss if it is deleted, and the rest of the neurons (~670) carrying token data.

For our understanding of the network, this allows us to split the first residual stream into 2 minimally-interacting modules: The position part and the token part. From our analysis of the position neurons, we could split that module even further, but for now I’ll focus the analysis on the framework we can build with only those 2 modules.

Layernorm

The first nonlinear computation that happens in the network is the first 0th layer layernorm. In order, it:

Subtracts the mean of the neuron activations in this layer from all of those neurons, such that the new mean is 0
Divides all neurons by the neuron standard deviation, such that the new standard deviation is 1
Multiplies each neuron by a learned weight
Adds a learned bias to each neuron

All of this happens after a 10% dropout. This means that the dropout doesn’t just erase a select few neurons, it also introduces noise to the entire network by randomly shifting both the neuron mean and the variance. This dropout is removed after training, but when we ask why the network is doing something we need to consider its training configuration.

We can see how the network has adapted to minimise that noise. The position-activation functions of the position neurons were balanced, so that both the linear parts and the periodic parts would be half positive and half negative at any position, all at similar magnitudes. This would keep the pre-dropout neuron mean and the distribution of post-dropout neuron means as close to 0 as possible.

The particular neuron-alignment and the shapes of the position-activation functions could have a similar reason. We have neuron aligned, highly redundant curves of only 2 types that aren’t proper sine waves but something close to piecewise linear. I would guess that this setup minimises the loss of information when subjected to random deletions (dropout), random constant shifts (layernorm mean) and random linear factors(layernorm variance), though I don’t know where to start proving this.

To keep the analysis simple and linear, we want to find a way around the nonlinearity.

For each of the token and position embeddings, we can set their neuron mean to 0.

Using the independence assumption from the previous part, the neuron standard deviation can be decomposed into the token embedding standard deviation and the position standard deviation (totalSD²=tokenSD²+positionSD²).

If we assume the position variance to be its average, the effect of the layernorm on the token embedding is deterministic for each token, so we can modify the token embedding to reflect the expected post-layernorm state instead.

We can do a similar thing to the position part, taking an average token variance.

Then we can include the learned layernorm weight, so that the input to the attention layer is easily decomposable (using python's @ for matrix multiplication):

Post-Layernorm activation vector₇₆₈ ≈ token₅₀₂₅₇@ ₅₀₂₅₇token_embedding_LNmod₇₆₈ + position₁₀₂₄@ ₁₀₂₄position_embedding_LNmod₇₆₈ + LayernormBias₇₆₈

Lateral inhibition

Mainly, the effect of the Layernorm that this approach is ignoring is lateral inhibition between position and token. If we use the assumption of separate spaces, the total variance is always the sum of the variances of the position and token embeddings. Since the Layernorm sets the variance to 1, any difference in variance before results in a shift of variance between the position and the semantic parts of the space.

For example, the starting position has a much larger variance in its embedding than other positions. So after the layernorm, semantic data from some token at the starting position will have a correspondingly smaller variance than at any other position. The same thing happens in the token embedding, where large variance approximately corresponds with token rarity.

The linear approach I use here ignores this effect, but it lets us analyse the rest of the network while keeping the two spaces separate.

Attention

Query-Key maths

There are several ways to think about the query-key computation, depending on the level of parallelisation. The main view that I’ll be using here:

Query: We only care about the computation that happens for one token in a sequence of tokens. The algorithm is the same for each token, we only need to understand it for one of them. So we’re not thinking in terms of parallelisation over queries. We think about only one residual vector that gets projected by the query weights.

Key: This one query-side embedding looks at many key-embeddings, in order to softmax over their scalar products. So we’ll consider all possible residual vectors for the key side. This means that we can think about changes on the key side for some specific query-side embedding.

Heads: Since each head’s functionality is independent of the other heads, we’ll look at each head individually.

The way we think about what happens in the attention score calculation is that (for each token, for each head) we take the query-embedding of that token, then take its scalar products with all of the key-embeddings of tokens before it and softmax over these to get the attention scores. This means that those scores don’t change if we add a constant to all of those scalar products. With that, let’s decompose the scalar product. The two factors are:

Query

token₅₀₂₅₇@ ₅₀₂₅₇token_embedding_LNmod₇₆₈@ ₇₆₈query_weights₆₄ (constant)
+position₁₀₂₄@ ₁₀₂₄position_embedding_LNmod₇₆₈@ ₇₆₈query_weights₆₄ (constant)
+Layernorm Bias₇₆₈@ ₇₆₈query_weights₆₄ (constant)
+Query Bias₆₄ (constant)

Key

token₅₀₂₅₇@ ₅₀₂₅₇token_embedding_LNmod₇₆₈@ ₇₆₈key_weights₆₄ (variable)
+position₁₀₂₄@ ₁₀₂₄position_embedding_LNmod₇₆₈@ ₇₆₈key_weights₆₄ (variable)
+Layernorm Bias₇₆₈@ ₇₆₈key_weights₆₄ (constant)
+Key Bias₆₄ (constant)

Since the scalar product is distributive, there are 16 components here that we can analyse. But, since 8 of those, namely anything involving the Key Bias or the Layernorm Bias on the Key side, add a constant to all of those scalar products and thus have no effect on attention scores, only the remaining 8 are relevant. If we understand what those are doing, we understand what the attention head pays attention to.

Example for visualisation and testing the approximation

Let’s look at an example to see what we are doing here and how much damage ignoring the nonlinear effect of the layernorm is doing.

Here I’ve just split the Scalar products for one specific head and position into the mentioned components. Notably, the approximation for the pre-softmax scalar product isn’t particularly good, being consistently 0.5-1 less than the actual value.

But practically every research project that analyses weights takes this approach of basically ignoring the layernorm or at most linearly merging it into adjacent weights, ignoring the nonlinearity. How can that be reasonable, when even after some adjusting for variance we still get an approximation this bad?

To get the exact result, we’d have to include variance correction factors for each interaction. For example, the value 15.15 (key(position 88) interacting with query(position 90)) would need a factor for the neuron variance of the token “ 19” and one for the variance of the token “ century”. Both of these have less variance than the average token, so our approximation underestimates both position components interacting here.

But since these correction factors are almost constant for each particular component, they don’t influence the analysis of the component itself. We might get the absolute values wrong but conclusions about relative values like “the position-position component of the scalar product for this key token would be larger if the key position was closer to the query position” will still be valid. This is a byproduct of the structure of SoftMax, which doesn't care about the absolute values of its inputs, and it will be more of a problem once we get to Layer 0’s MLP, because the nonlinearity there (ReLU) does care about absolute values, but we’ll deal with that when we get to it.

Scalar product maths

In matrix multiplication, you get scalar products by simply multiplying two vectors, but with the second one transposed. This principle also works with parallelisation, such that the Query-key part of attention is often programmed as Query@Key.T . This is a neat way to do scalar products, because the transpose of a matrix product is the product of the transposes of the factors, in reversed order: (A@B).T=B.T@A.T

So if we take

(token₅₀₂₅₇@ ₅₀₂₅₇token_embedding_LNmod₇₆₈@ ₇₆₈Query₆₄) @
(another_token₅₀₂₅₇@ ₅₀₂₅₇token_embedding_LNmod₇₆₈@ ₇₆₈Key₆₄).T

, one particular component of a scalar product in some attention head, it can also be written as:

(token₅₀₂₅₇@ ₅₀₂₅₇token_embedding_LNmod₇₆₈) @
(₇₆₈Query₆₄@ ₆₄Key.T₇₆₈) @
(another_token₅₀₂₅₇@ ₅₀₂₅₇token_embedding_LNmod₇₆₈).T

And continuing this principle:

(token₅₀₂₅₇@ ₅₀₂₅₇token_embedding_LNmod₇₆₈@ ₇₆₈Query₆₄@
₆₄Key.T₇₆₈@ ₇₆₈token_embedding_LNmod.T₅₀₂₅₇) @ (another_token₅₀₂₅₇).T

But since the right factor of this product only ever occurs as one-hot vectors, the left factor must be a list of all the possible scalar products that this component can have, in the one-hot encoding’s order.

Obviously, the same works for positions.

Key Position components

There are 4 components of the key-position part of the scalar product.

Query-position

Query-token

2* Biases

We’ll look at these in turn.

In order to test and improve your intuition, this would be the time to make some predictions about what they will look like. Some questions that might help:

What will attention heads want to pay attention to, position-wise, in the 0th layer?
How could an attention head use the kinds of position-data available (linear, periodic) to restrict its search to that? Hint: each head only looks backwards
How will that structure be distributed among interactions with query-side bias, position and token?

The Biases

These graphs show, for each attention head, the component of the total scalar product between the biases on the query side and the position on the key side. Most of these follow a simple pattern, a linear slope going upward with position.

This should let the head focus on the latest few tokens, as it penalises positions linearly with their distance to the query position (1).

It makes sense that the linear parts of the position data would be used for this (2),

and since those would need some constant value on the query side to ensure a positive slope, we shouldn’t be surprised to find this when looking at the biases(3).

Some observations:

Head 5 is doing something entirely different, instead focusing on the first position in particular.
Heads 3 and 7, and to a slightly lesser extent 4 all have noticeably larger slopes than the rest. This should mean that they have a stronger focus on positions directly before the query, assuming similar values in the other components.
The two biases mostly behave the same, even balancing out some of each other’s deviations from linearity. In heads 9 and 11, they have noticeably different overall slopes, though.

So far, so good.

Query Positions

For the position-position interaction we have 1024 such images for each head, one for each query position.

A couple examples:

These are for Head 0, and show the scalar product component for every 100 query-side positions. The vertical line shows the query position on the key position scale. Since the network can only look backwards, only the graph to the left of this line is relevant for the algorithm.

Overall, we can see that the total scale is less than that of the biases. At query position 500, for example, there is a part of the graph around key position 150 with larger values than the previous couple tokens (which would imply that the attention head is particularly paying attention to those positions), but that difference of ~3 is overwhelmed by the difference of >20 between position 150 and 500 from the biases.

This interaction seems to be here to add to the slope of the biases, but only for recent positions.

If a network wants to create an upward slope around the current position, the result will be a combination of all the independent ways it has to achieve this. The same effect of increasing the slope could just be achieved by scaling up the weights connecting the query-biases to the key-positions. But, since position-position interactions can also increase that slope, with side effects that are irrelevant with a large enough slope from the biases, we also find that method in the network.

We don’t even need penalties for large weights in the network for this to happen, it will just always happen when the methods are independent and without side effects.

Let’s see if these are continuous with respect to changing the query position, by merging these images into an animation:

Query Tokens

Some graphs:

These are just 12 random tokens and their key-position scalar product component, for head 0.

For most of these the scale is much smaller than for the other key-position components. Also, all of those that do have a large range of values tend to follow the same linear pattern, some with positive and some with negative slope.

After sorting the tokens by that slope:

In conclusion:

Most tokens don’t change much about which key-positions are attended to
For the tokens that do have a larger effect, this effect can be summarised as an overall increase or decrease in position-slope

Now we can make a position scalar product animation for each query token, so here are 2 examples from head 0: the two tokens contributing the largest and smallest slope, respectively.

We can see that even in both of the extreme cases, the overall structure of the sum stays similar. This also holds for other heads: For all heads except 5, there is no query token that changes the structure set up by biases and position. All a token does is slightly adjust the slope.

What even is a feature

In general, we think of features in a layer of a neural network as directions in activation space that represent some input structure. Consider the binary encoding of a natural number less than 16. One could take the set of directions [0001, 0010, 0100, 1000], give each a simple description of the represented structure and understand everything there is to understand about it, including how it uses these features to calculate things. One could also say “0000 represents the number 0”, “0001 represents the number 1” and so on to have a full record of what this part does, but without any underlying structure. Call these extremes “understandable concepts used in computation” and “atomic input”. Between these, there might be some number of ways of decomposing this structure into features. As we go towards “atomic input”, their sparsity increases, up to only one being active at a time.

Of course, input space is really large, 50267^1024 for GPT2-small, so “atomic input” tends not to actually be viable. But in my view, the main point of interpretability techniques that find a sparser representation should be to provide a starting point for finding some level at which we can actually understand the computation.

In the previous section the main takeaway is: “to pay attention to the closest few tokens, a head wants a positive Key-position slope, magnitude dependent on how far it wants to look back. The overall shape of this slope is mainly produced by bias-position interaction, and is adjustable for each query token, although this adjustment tends to be fairly small.” Here, the “feature” of Key-position slope can’t even be usefully expressed as a direction or set of directions in some activation space, but it clearly is important for our understanding of what the network does.

Key Token components

For this section, we can no longer rely on the nice structure from the position to give us obvious things to try. Fortunately, this is the first point of cross-token interaction, so the size of the input space (number of possible different inputs) when focusing on one query token is only 50267*1024 (50267 possible key tokens, at 1024 possible positions), and only 50267 after separating out the position part. So we’ll simply take the “atomic input” feature framework to try to find out what each of the heads is doing. Of course it would be nice if we had a “understandable concepts used in computation” framework for the token embedding, as then we would be in a much better position to find out how the heads are doing what they’re doing.

Instead our goal will be to refine our understanding of the heads to find some sort of categorisation of what each head is doing. It’s no longer useful to compare all heads to each other, instead we’ll have to go through, head by head, to figure out their function.

From the position analysis, we already have a baseline for the meaning of relative magnitudes of scalar product components: If some token gets a scalar product of x more than some other token, the head prefers the first token, unless it is x/slope further away from the current position.

In general, a token can only be attended to if its token-based scalar product is distance*slope more than the query-token’s token-based scalar product.

This also means that an attention head will tend to have the current token as baseline token attended to, in case it doesn’t find what it is looking for.

Query Tokens

Unlike the positions, tokens don’t have some kind of natural order, and there is no reason to expect anything like the continuity we found there. Instead, we’ll have to take a couple of analysis approaches and see what seems to work.

Sorted by variance/std: The larger the variance for tokens on the key side, the more this head discriminates between different key tokens for this specific query token, so we look at the query tokens with the largest variance and their largest scalar product key tokens
Self-rank/self-correlation: 2 metrics to see whether a head tends to have a large scalar product for previous exact copies of the query token
Sorted by variance, but key-to-query: This is sort of asking the question “given a token, what is the distribution by which later tokens would care about this token?” This goes against the logic that we’ve carefully set up so far, because a key token having the same scalar product with every query token doesn’t mean anything for the softmax. Even so, it works to find useful examples to analyse.

Heads 3, 4 and 7

All of these heads already stood out in the position analysis, let’s see what their token side looks like.

This is the full query_token-to-key_token Scalar product matrix.

The full query-token to key-token matrix for head 3. Horizontal and vertical mean normalised (more on horizontal mean in the “biases” section), only every 30th row and column shown here, since the full image would be 50257x50257

We can see that this head tends to have a large scalar product for copies of the query token itself, but not much else is directly visible.

Let’s first sort each column by scalar product:

Then permute the columns so that they are sorted by variance, descending from left to right:

And finally zoom into the top left corner:

This image gives us some overview of which key tokens the head is trying to find (large scalar product), for the query token that the head cares about the most (large variance among key-token scalar products^[1]). A lot of the images in this section follow the same zoom-in principle.

Head 4:

Head 7:

It seems pretty clear what’s happening here. If the current token is part of an incomplete semantic unit (usually a single word), these heads find the part of that unit/word that uniquely identifies it. With that, it makes sense that these heads would have the largest position slope, since the token they are looking for should be very close to the query position, in most cases directly preceding it.

Some examples of the same query token across heads:

In summary, these heads try to find the rest of the current token’s semantic unit, if it is incomplete. They do so by having a large position slope (i.e. only looking at the closest few positions) and by looking for possible beginnings of the current semantic unit. This happens somewhat redundantly, as all three heads look for most of the possible previous words and most previous tokens correctly found by one of the heads will also be found by another, but they tend to have different parts of wordspace they are primarily looking for.

Heads 1, 5 and 10

Head 1:

Head 5:

Head 10:

All of these heads are looking for previous instances of the same token or tokens that are very similar. Only head 10 has a special case where for personal pronouns it looks rather for the referent than another reference.

This is also noticeable in the self-rank and SP-cos_sim-correlation statistics:

So what is the difference between the three heads?

Here again the graphs for the key position - bias components:

Head 5 looks for occurrences of the query token at the start, while heads 1 and 10 prefer recent ones.

In general, to get some idea about how far back a head is trying to look, we can compare the slope of those graphs with the standard deviation of the token - token component. Any positional difference that overwhelms the normal token differences is going to be out of scope for this head.

For these 3 heads, we have a slightly different situation: The query token itself is a token that the head wants to attend to. For the word-completion heads 3, 4 and 7, the query token was the default option for positional reasons, here it is the fallback option also for semantic reasons.

So for heads 1 and 5, since their token standard deviations are very large in comparison to the position scalar product differences, only very similar words will be attended to, because there is always at least one (the query token itself). The positional part provides a ranking amongst those. Looking at the position graphs, 1 prioritises the latest occurrences whereas 5 prioritises the first occurrences.

Head 10 has a much smaller standard deviation for token-token interactions. So it will more readily give some of its attention to tokens that are similar, but not the same. For pronouns specifically, it looks for some referent rather than the same pronoun again. Positionally, it also prioritises the latest occurrences, similar to head 1.

Heads 0, 2 and 8

First, looking at an example from the token-token interactions for each head:

The structure here is less easily interpretable. One thing we might say is that they also seem to look for the meaning of the current token, but across semantic units (unlike heads 3, 4 and 7) to find the meaning in a broader context. For example, apparently the token “ Aber” pretty much only appears in the words Abercrombie (in “Abercrombie and Fitch”, a clothing retailer) and Aberforth (a minor character in the Harry Potter universe), so head 2 tries to distinguish between those by looking for other words that appear in the same context.

The problem here is that there are fewer context-defining tokens than there are context-dependent tokens, so that the variance of the context-defining tokens doesn’t give that strong a signal. So let’s try looking at the interaction the other way around (looking for possible query tokens, given some key token):

These are much easier to read. It seems like these heads are looking for words related to the query token in different contexts, to pin down the specific context this token is being used in. As far as I can tell, this happens somewhat redundantly, but the heads have clearly different areas they focus on. For some examples, Head 0 has high variances for sport and coding, Head 2 for fiction and geography, and Head 8 for cars and drugs.

An example of these heads’ function on the same query tokens. Clearly, head 8 is the main one responsible for this section of wordspace, head 2 is picking up some possible outliers with apparently very little overlap and head 0 is almost completely out of its distribution (apparently, Powershell is the only environment with a capital-c Convert command). The lower case works much better for both heads 0 and 2: apparently there is more lower-case-c conversion happening in chemistry and religion.

A couple examples to get a glimpse of how the 3 heads separate word-space between them:

Head 6

This head seems to do something similar to 3, 4 and 7, except mainly on non-standard or non-english names and coding syntax

My guess for what’s happening here:

Foreign names tend to follow different rules for construction and so can’t reliably be completed by heads 3, 4 or 7. This also happens for some fictional names, note Thanos, Westeros, (Phy)rexian and (G)rixis above. Coding syntax doesn’t tend to follow normal English rules either, so it is also handled here. In code, a single semantic unit (e.g. a pair of brackets) can span a long distance so this head needs a shallower positional gradient than 3, 4 or 7.

For normal English words, it seems like this head is looking for fitting prefixes but with much less precision. This could be to catch misspellings of the word, or just a consequence of the handling of special cases.

Head 9

In this figure, head 9 stood out by having a very large average rank for the query token itself (49125/50257). This suggests something like “find the opposite of the query token”, analogous to heads 1, 5 and 10.

On the other hand, the SP-cos_sim correlation was -0.18 , so negative as expected but not as large as for the heads trying to find copies of the query token.

Some highest variance token-token examples:

And some other examples that came up before:

Here we can see that there is something opposite-like going on. It’s not quite just taking the negative of the embedding vector, which we also know from the low correlation. For some categories of words it’s mainly looking for other examples in that category, like “geographical location” or “fictional”, but excludes words from one subcategory below it like ”in the US” or “Star Wars”. In particular, “Geographical location” and “Sports team” seem to be things the head is often looking for.

This is a graph where I bucketed all 50257^2 pairs by their cosine similarity, to get some idea of the relationship between semantic “closeness” and the head 9 scalar product. The green histogram lets us dismiss the data points between 0.85 and 1 as outliers (these are glitch tokens, real words don’t really get a similarity above 0.9. As an example, the similarity between “ car” and “ cars” is 0.72 and between “ big” and “ large” is 0.65), and we are left with a maximum at cosine similarity ~0.5, and a sharp decline as the tokens get more semantically similar.

What is it doing: It is looking for high-information-density tokens that are some specific semantic distance away from the query token.

How is it doing this: Interesting question for which we would need a better understanding of the token embedding.

Why is it doing this: My guess is that it is trying to find the scope of the current context. For example, when seeing the token “ Winnipeg” without any distant geographical locations, we’d expect the current piece of text to be something like a local news article or wikipedia page. But if “ Tunis” was mentioned recently, it is much more likely that the scope is global and Winnipeg is just one example.

Head 11

This is the head where I found the least amount of interpretable structure. The main noticeable point of data is the distribution of variances for query tokens. The token “ Edition” has a variance of around 100, while the second-most variance is around 50. This is such a massive outlier that it should be directly related to what this head is doing. Notably, uncapitalised “ edition”, though it has a similar ranking of key tokens, has completely in-distribution scalar products. Also notable is that most of the highest-variance query tokens are related to gaming or fiction.

My best guess as to what this head is doing is that it's trying to place words by context, similar to heads 0, 2 and 8, but with a strong focus on gaming, fiction and tech.

Query Positions

When looking at the scalar products of tokens for different positions, most heads show the same pattern: Early positions favour common tokens and penalise rare ones while late positions do the opposite. The transition between the two is pretty linear, so that for some position in the middle we have a very small variance. Only heads 5 and 11 don’t follow this pattern and just have relatively low variances on all positions.

I am not sure why this happens. It is remarkably consistent across heads, so my guess is that it has something to do with some general structure of the training data.

The Biases

This part is mainly about the Query Layernorm Bias - Key Token and Query Bias - Key Token interactions, but earlier we also deferred discussion of the average Query Token - Key Token interaction to here.

Example scenario: For head 7, every query token has a large scalar product for the key token “ in”, 15.2 on average. This should have the same effect as there being a large bias component for that key token, i.e. the head having a general preference for attending to that token. So it should be interesting to see how those 2 things relate to each other:

Apparently, the biases tend to try to cancel out the token average’s effect. For heads 0 ,2, 8 and 11, there is a large negative correlation between the bias component and the average token component. Head 7 also has a large correlation, but on an even larger negative slope, so that the biases not only cancel out the average’s effect, but reverse it. For most other heads, the same effect is also there, but weakened, only head 9 having around 0 correlation.

My guess for why this happens: even if some head has a general preference for some tokens over others, that preference will not tend to be aligned with the average token component, so the bias component is used to cancel or adjust this behaviour.

Especially heads with complex semantic calculations going on (0, 2, 8 and 11) will both have average-affecting side effects to be cancelled and not have large general preferences since what they’re looking for strongly depends on the query token.

Heads 3, 4, 6 and 7 do have words they are looking for in general, namely tokens that are unfinished words, and so the biases only slightly correct, or in 7’s case reverse, the preferences given by the averages.

Heads 1, 5 and 10 don’t have anything they’re looking for in particular and so we do see them slightly correcting the average, but there is an ordering of the importance of their function: it’s important to know exactly where e.g. “ Rasmussen” was mentioned before, but it’s less important to know exactly where “ to” was mentioned before. The softness of the softmax will always introduce some imprecision, but we can shift the imprecision slightly by having a general preference for names and other unique identifiers and against common particles.

Head Archetypes

We've found that the 12 attention heads can be divided into 4 groups by their function:

semantic unit completion (3, 4, 6, 7)
same token attention (1, 5, 10)
context clarification (0, 2, 8, 11?)
whatever head 9 is doing (scope finding?) (9)

These give us a catalogue of concepts that we can work with as we move forward to the Layer 0 MLP.

Conclusion

We now have some understanding of what gpt2-small does up to the 0th-layer attention block. We’ve arrived there by splitting up the residual stream into 2 modules, and seeing how those modules go on to interact. There are a few particularly interesting insights into how this network functions, like how a positional slope is created or which attention head archetypes are present in layer 0, but I think the main result is that there is a division of the network into (large, not even slightly sparse) modules that helps with understanding and that understanding the main functions of a network (at least in layer 0) is not particularly difficult, if you approach it in a structured, modular manner.

In the end, this approach gave a much more clear and complete overview of what is actually happening inside the network than I envisioned when I started, so I’d still like to see how far it can go.

Next Steps

There are a few obvious ways to proceed:

continue to the MLP
do the same thing for other networks or just check if same patterns present
- networks with learned position embedding
- networks with hardcoded/otherwise different position embedding
find a good basis for the token embedding, go deeper in the analysis (how exactly do heads 0, 2 and 8 look for relevant context tokens and separate word-space between them?)

I currently really prefer doing the first of these, I invite anyone to do the second, and for the third I have a collection of notes about what doesn’t work and what almost works, that I might put into a blogpost if people are interested in it.

^{^}
Other ways of measuring this can also make sense, like the length of the key-embed vector, but I think that for most sensible metrics, you get a very similar cluster of “query tokens that this head cares about the most”

[-]fragmential2mo10

This is cool. Are you planning a part 2?

[-]gammagurke7d10

Eventually. I hope to either get enough free time or funding to continue with the same style to the MLP, but for now I have other projects. As a side-effect, some of those other projects might make this style of analysis easier. I do intend to understand gpt2 before AGI.

24

Simply reverse engineering gpt2-small (Layer 0, Part 1: Attention)

24

Motivation

Note on Model Structure

Positional embedding

Token embedding

This embedding is a fairly normal word embedding with respect to word relationships

Linear general knowledge

The word embedding and position embedding are mostly separate

Loss testing

Loss decreased

Layernorm

Lateral inhibition

Attention

Query-Key maths

Example for visualisation and testing the approximation

Scalar product maths

Key Position components

The Biases

Query Positions

Query Tokens

What even is a feature

Key Token components

Query Tokens

Heads 3, 4 and 7

Heads 1, 5 and 10

Heads 0, 2 and 8

Head 6

Head 9

Head 11

Query Positions

The Biases

Head Archetypes

Conclusion

Next Steps

24

24