Finding the uncertainty vector in GPT2-scale transformers

In this post I explore a phenomena in LLMs where the training process naturally consolidates information in a highly interpretable structure in the residual stream, through a positive feedback loop from a small variation at initialization. I start with a toy example and work up to GPT2 scale, showing animations of how weights and activations evolve over training. I assume familiarity with the transformer architecture.

The exploratory tone of this post will likely lead to more questions than answers. The intended audience is people hoping to learn more about transformer internals and their dynamics over training. The motivating question being "What is going on in this GPT2-scale model because these charts look incredibly weird".

The dimensions in the residual stream are often thought of as an uninterpretable arbitrary rotation of the feature space, since the standard transformer does not have an operation that makes it a privileged basis. Yet, the behavior above for dimension 275 is remarkably distinct. These charts show the evolution of three dynamics over the course of 600 training steps as the model progresses from 10.8 CE loss to 3.6 CE loss:

How does the language model head, which is of shape [d_model, d_vocab], update for [275, :]? This means if the model blocks add to dimension 275 in the residual stream, what change does that cause to the prediction of each token in the vocabulary?
How do 3 specific neurons in the final MLP evolve? Each neuron multiplies its activation by an output vector and pushes in that direction. The first two neurons push in very specific dimensions, notably dimension 275. The third is included as a normal neuron baseline.
What is the activation distribution (sampling 60k tokens) of the residual stream right before getting multiplied by the lm_head for dimension 275?

The short answer is that dim 275 in the residual stream is functioning as an uncertainty vector. If the model puts 100% of its weight into 275, it will output roughly the naïve unigram distribution. Given the vector is normed, 100% of weight corresponds to root(768) or 27.7 in the last chart above. MLP Neurons then have a handle to perform several useful actions:

If a neuron fires on a very particular fact, it can add to dim 275 (make less negative) to indicate certainty.
If a neuron fires on ambiguous contexts, it can subtract from dim 275 (make more negative) to indicate uncertainty.

Part of why this occurs is because this version of GPT2 does not have a bias in the lm_head. A bias in the lm_head will directly model the unigram distribution, which decreases the gradient pressure for this dynamic to develop in the residual stream.

Toy Model

The first peculiar fact about the charts above is that training evokes a distinguished basis in the residual stream over time for dimension 275, even though mathematically it doesn't appear like there is any mechanism for this. To explore this further, I look at how simple initialization differences in a toy model can evoke roles for dimensions in the residual stream. Feel free to skip this section if this idea is already obvious.

The task the toy model will be learning is to predict the next digit (i) as a function of the preceding digit (j). The sequence of digits 0 to 9 will be sampled according to the scenario.

Scenario 1

The digits are sampled from the distribution [0.1,0.2,...,1], normalized such that digit 9 is 10x more likely than digit 0. The model is tasked with minimizing cross entropy loss via stochastic gradient descent.

I start with the simplest transformer possible: no blocks and a model dimension of 1. The model has 20 coefficients, one for each input digit and one for each output digit. The prediction logit that digit 4 will follow digit 3 is calculated by input_coef_3 * output_coef_4.

Input coefficients are initialized to the normal distribution, and output coefficients are initialized to 0. Training produces the following charts:

The input coefficient learns a constant because the sequence data is independent. The output coefficient learns the distribution to match the unigram distribution of the data.

Scenario 2

I now update the model dimension from 1 to 2. Training produces the following charts:

The computation is spread out across both dimensions. The second chart shows how the input coefficients sit on a line. A single model dimension has enough degrees of freedom to fit the data, but because there are so many more valid linear combinations that reach the result, it statistically improbable for the relationship to fall into exactly a single dimension. When this fuzzying occurs across a large number of dimensions in a larger scale transformer, it can become challenging to disentangle the underlying mechanisms.

Scenario 3

I now repeat scenario 2, but initialize the first dimension of the input coefficients to 10 instead of a normal distribution around 0.

The trained model isolates the learned portion into dimension 0. Dimension 1 output coefficients stay at zero. This occurs for two reasons:

Non-zero activation mean. Lets say I want to modify output_coef_4 weights such that I increase the likelihood of predicting digit 4 for all inputs. If all inputs are positive values, then I can increase the output by increasing the coefficient. If all inputs are negative, then I can increase the output by decreasing the coefficient. If outputs are mixed with a mean of zero, then the gradients largely cancel out. In the general case, if the subset of inputs that we want to shift have a shared sign for a given dimension, that dimension will have asymmetrically large gradient pressure on the output coefficients.
Higher magnitude activations. The gradient on the output coefficients is proportional to the magnitude of the corresponding input coefficients. Once the pattern in the data has been fit, the gradients will all fall to zero. So if one dimension can respond faster, it can claim the reward.

Scenario 4

Next I return to initializing the input coefficients from the normal distribution, but I add a twist to the data distribution: Digits 0-8 are sampled from the distribution [0.1,0.2,...,0.9]. I then update any tokens that follow 3 with 9. A trained model should always predict 9 when it sees 3, and otherwise predict the same distribution as earlier scenarios.

The computation gets spread out across both dimensions. The model learns to predict increasing probabilities for digits 0 through 8. The red line in the last chart corresponds to the output predictions for input 3. It is zero for all digits except for 9, where it jumps to 100%.

Scenario 5

I take scenario 4 and initialize the first dimension of the input coefficients to 10.

Once again, dimension zero dominates on modeling the pattern that applies to all input dimensions. Dimensions with consistent inputs can apply consistent patterns to inputs. Dimension 1 contributes more heavily to the prediction of 9 given 3.

The main takeaway here is that even though the mathematical structure of the model does not induce a preferred basis in the residual stream, parameter initialization can create a basis that persists throughout the training process. But does this actually scale to small LLMs?

GPT2-Small Scale

I will be referencing training runs from modded-nanogpt, which is the GPT-2 scale model I am most familiar with. It has a model dimension of 768, 11 layers, and 6 heads per layer, and no biases on any linear projections or the language model head. The GPT2 tokenizer is used with a vocab size of 50257, with 47 padding tokens at the end to give a clean 50304.

At initialization

The training dataset is FineWeb, which like most other training dataset, has tokens that roughly follow a log normal sampling distribution. That distribution is shown below:

The spike in the bottom right plot on the left tail corresponds to the 270 tokens with zero occurrences, which are defaulted to 1e-8. The most common tokens are listed below. In general smaller token_ids tend to have higher frequency.

Token_id	Token_str	Fraction
13	.	0.0378
11	,	0.0360
262	_the	0.0336
198	\n	0.0217

I will look deeper at token 262 '_the', which makes up 3.36% of the training corpus. Below is its initial embedding weight across the 768 model dimensions, which is sampled from a normal distribution:

This looks exactly like one might expect, with a mean very close to zero of 0.03. The most extreme value of the distribution comes from dimension 275, with a value of -3.71875.

The chart below shows what dimension 275 looks like across all tokens in the vocabulary, with '_the' marked with a red dot:

Things start to get interesting when we look at the activation distribution. That is, when we pass the data distribution into the model, what distribution of embedding outputs is produced in the residual stream?

The distribution is no longer normal. The spike from '_the' at -3.71875 is shifting the mean to -0.22. How does this mean compare to the other 767 dimensions?

-0.22 falls near the far left tail, which indicates that dimension 275 starts out with one of the most lopsided distributions.

At initialization the language model head and all output projections from each block are set to zero.

During Training

On step 1 the only gradient pressure on the model is on the lm_head. The lm_head vectors will update in the direction that models the bigram distribution. In other words, every time the model sees the sequence '_the X', token X will update its lm_head vector in the direction that makes it more similar to '_the', and all other lm_head vectors will update in the direction that makes them less similar to '_the'.

Since dimension 275 is sharply negative for '_the', whenever a token follows '_the', it will see its dimension 275 decreased. All other tokens will increase.

Roughly 9k/50k tokens see their lm_head vector decreased on step 1 for dimension 275. The mean of this chart is 0.005, as most tokens see an increase. However, the tokens that decrease are the ones that occur most frequently in the data. Naturally, if a token frequently follows 'the_' it will generally occur more frequently.

If I weight every token by its rate of occurrence, the weighted mean of dimension 275 in the lm_head drops to -0.0035. To understand if this is substantial, I plot the weighted-mean shift of every one of the 768 dimensions of the lm_head below.

-0.0035 sits on the far left tail.

On step 2 the MLPs in the network start to engage. A single neuron in an MLP with Relu activation and no bias can be thought of as randomly bisecting half the activation space to sample half the inputs, and then choosing single direction in model space to push those inputs. At this stage, the only signal the MLP has is 'push this batch of inputs in the direction of the lm_head for their target tokens, and away from the lm_head of the other tokens'.

Most tokens have a positive lm_head value on dimension 275, but the tokens that frequently follow '_the' and frequently occur in the forward pass have a negative value for dimension 275. And so MLP neurons are incentivized to push in the negative direction for step 275.

The chart below shows what direction MLP neurons 'push' on step 2. Each layer has 3072 neurons. Across 11 layers this gives 33,792 neurons.

The mean of the distribution is slightly negative. Each input will activate for roughly half of the neurons. This means that each input is going to roughly sample 16,000 times from the distribution above. The charts below show the resulting activation on Step 2 after these MLPs fire.

The third chart shows how 98% of activations are getting pushed further negative. At this point, the network has now completed the reinforcing feedback loop. The full sequence of events is:

Anomaly occurs in Dim 275. Performing log-normal weighted sampling over the vocab space (which is what naturally occurs when we sample on the forward pass) from a normal distribution causing skewed activation distributions in several dimensions at initialization. An example of this is dimension 275, which is heavily impacted by token '_the'.
Anomaly spreads to tokens that follow '_the'. On step 1, tokens that follow '_the' decrease their lm_head value in dimension 275, and tokens that don't follow '_the' increase their lm_head value in dimension 275.
Anomaly spreads to 98% of inputs. Each neuron activates on roughly 50% of the inputs. If a neuron activates on a large number of tokens from step 2, it will push that full batch negative.
Repeat. The cycle now repeats in step 1, but instead of 'tokens that follow _the', its 'tokens that follow 98% of inputs'.

This cycle effectively monopolizes dimension 275 to perform the function of modeling the unigram distribution, and it all starts with a single seed value of -3.7 for token '_the'. From the framing of a spreading virus, '_the' is a prime viral vector.

Finding the full Uncertainty Vector

Does the same effect occur in other dimensions? Below is the mean activation right before the lm_head for each dimension at step 600.

275 shows the largest magnitude, followed by 573 and 360. No other dimensions appear to have this dynamic.

All three dimensions show similar distributions, indicating that the actual 'uncertainty vector' is smeared across three dimensions in the residual stream. Below shows the initial activations induced in dimension 573 by the embedding:

The 'seed' for dimension 573 is planted by the right activation around 2.4. Perhaps coincidentally, this also corresponds to the token '_the'.

I wanted to revisit one fact in this chart:

There is a small spike of height 100 in the histogram that starts to go slightly positive around step 400. This corresponds to the beginning of sequence token ,which occurs at the start of every document during training. Here is its activation across all 768 dimensions on step 600:

It has the largest magnitude activation in dimension 360, and a very small positive activation in dimension 275. How do the predicted probabilities differ between the unigram distribution, dim 275, and dim 360?

Since the softmax function is scale sensitive, we can only compare these distributions to the unigram distribution after applying a scalar factor. A uniform prediction across all 50304 tokens gives 10.8 loss. A perfect prediction gives 7.66 loss. Here is how the loss scales for dimensions [275, 360, 573] as the activation magnitude is varied from 0 to 30:

Both dimensions 275 and 573 achieve minimum loss against the unigram distribution exactly when 100% of the activation is placed into them, which gives a normed activation of 27.7.

To see which activation most accurately predicts the unigram distribution, I run the following code:

# Compute which activation most strongly predicts unigram distribution
class UnigramModel(nn.Module):
  def __init__(self, d_model, d_vocab, fixed_head):
    super().__init__()
    self.w = nn.Parameter(torch.ones((1, d_model), device='cuda'))
    self.lm_head = fixed_head.clone()[:,:D_VOCAB]
  def forward(self, y):
    logits = (self.w @ self.lm_head).squeeze(0)
    log_probs = F.log_softmax(logits, dim=-1)
    loss = -(log_probs*y).sum()
    return loss
target = frac
unigram = UnigramModel(768, D_VOCAB, model.lm_head.weight.T.float().data)
optim = torch.optim.SGD(unigram.parameters(), lr=1)
for step in range(10000):
  loss = unigram(target)
  loss.backward()
  optim.step()
  optim.zero_grad()
  if step%1000==0:
    print(loss.item())

data = unigram.w.data.cpu().numpy()[0]
plt.figure(figsize=(12,4), dpi=100)
plt.title('Unigram Activation')
plt.plot(data)
plt.vlines(x=range(768),ymin=0, ymax=data)
plt.show()

This gives 7.73 loss, very close to the perfect score of 7.66. To predict complete uncertainty, an activation should roughly put 50% of its magnitude into negative dim 275, 25% into negative dim 360, and 25% into dim 573.

What is the biggest difference between dims 275 and 360? Here are the top 10 tokens that 360 prefers over 275:

['The' 'A' 'I' '1' 'H' 'L' 'the' 'In' 'P' 'B']

Here are the top 10 tokens 275 prefers over 360:

[',' ' the' ' and' ' in' ' a' ' (' ' ' ' �' ' on' ' of']

Dim 360 appears to model uncertainty conditioned on start of document, as the tokens are capitalized with no leading space, whereas 275 models uncertainty conditioned on middle of document.

What is the biggest difference between dims 275 and 573? Here are the top 10 tokens that prefer 573 over 275:

'ixtape' 'gorithm' 'adata' 'umbnails' 'initions' 'INGTON' 'ospels' 'helial' ' ..............' 'ウス'

Here are the top 10 tokens that prefer 275 over 573:

',' ' the' ' and' ' in' '.' ' (' ' a' '-' '\n' ' �'

Dim 573 appears to model uncertainty conditioned on middle of word, whereas 275 models uncertainty conditioned on start of word.

Takeaways

The residual stream is not an arbitrary basis. It is given meaning through the spikey initialization of activations that occurs from sampling from tokens that follow a log-normal distribution, that kickoff positive feedback loops during training.
Interpretable structure of the final LLM can be seeded from small tweaks at initialization. In this case, the evolution of the uncertainty vector is traced back to very specific initialization values of the token "_the".
The uncertainty vector can condition on multiple factors: ["Start of document, Start of word in middle of sentence, or middle of word"]. Each of these is skewed towards one of 3 dimensions in the residual stream.
If we can better understand positive feedback loops that elicit separable structure during training, would it be possible to define an initialization seeding that evokes a much more generally interpretable model? EG we perform specific initialization for tokens related to "warfare" in dimension 15, such that a positive feedback loop kicks off during training that reinforces how dimension 15 relates to the concept of "warfare".

code: https://github.com/ClassicLarry/uncertaintyVectorLLM/tree/main

LESSWRONG
LW