Week One of Studying Transformers Architecture

JustisMills

I asked Claude already, but he's kind of a suckup, so if any experts find themselves reading, please tell me if anything here is super wrong.

I’ve been writing a lot about the meta level, which I think can be a trap for blogs in general, and blogs in the rationalist diaspora specifically. So this post is going to go into the weeds. Specifically, I’m going to write down all the stuff I’ve figured out about transformers architecture in a week of study. I may do more of these as my knowledge crystallizes.

You probably shouldn’t use this post to learn perfectly reliable information about transformers - I’m a beginner, and am happy to be corrected by readers for stuff I get wrong. But you might find it interesting if:

You’re curious how difficult it would be to get up to speed on transformers architecture yourself
You want a really high level view from a fellow beginner
You’re interested in learning processes more broadly

Or, of course, if you just like reading technical blog posts for its own sake. To these readers, especially, I bid you a warm welcome!

What are transformers?

A transformer is a kind of deep learning network, first introduced in 2017 by the paper Attention Is All You Need. Traditional transformers, of the type in that seminal paper, are good for sequence to sequence (or seq2seq) tasks, where both the input and the output are sequences. For example, translation, where the input is some words in one language, and the output is some words in another.

The big innovation transformers brought to the table was leaning harder on an already-existing piece of machine learning architecture called self attention. I’ll get to attention mechanisms later on, but for now just think of them as the transformers secret sauce. Some earlier models used attention in tandem with various other techniques; Attention Is All You Need showed that you could get better performance relying exclusively on attention mechanisms.

Most of the splashiest AI progress of the last 7 years has been on top of transformers architecture. Modern AI chatbots are transformers, as are LLMs (large language models) more broadly. In general, most of the recent gigantic burst of investment into AI has been a race to build ever larger and more sophisticated transformers.

What sort of transformer is an LLM?

I’m glad you asked! LLMs are autoregressive transformers, meaning (roughly) that they take in a sequence (e.g. a bunch of text), and output just one thing at a time (e.g. the predicted next few letters of that text). So when you send a message to ChatGPT, it actually figures out its response one syllable (or so) at a time, then feeds that next syllable to the end and treats that as its new input. So while traditional transformers tended to take in a sequence and spit out a sequence, autoregressive transformers take in a sequence and spit out exactly one thing at a time.

Because of this, autoregressive transformers don’t have something called an encoder layer. I’m not sure what encoder layers do in traditional transformers (yet), but they aren’t needed into autoregressive transformers, which only have decoder layers. More on these later.

Very big picture, what does a GPT model do?

Oh man, a bunch of things. I’ll stick to the very high level here. Also, I’m going to describe what a GPT does at inference, meaning when you ask a completed model to do its thing. This is distinct from what a GPT does during training, when it’s figuring out how to be as accurate as possible.

Broadly, a GPT goes through several stages. First, it receives some text. Then, it breaks that text down into tokens. I’ll get into tokens and where they come from soon, but for now just think of them as little snippets of text. Like, the sentence “I enjoy eating pancakes” might split into the tokens [“I”, “ en”, “joy”, “ eat”, “ing”, “ pan”, “ca”, “kes”]. If you wanted, you could also have a token for every letter, or for every word.

Anyway, once it has the relevant tokens, it looks them up in a database that it has on hand, which maps every token to a unique number. So now, rather than some words, we’ve got a bunch of numbers.

Each of those numbers now corresponds to a row in something called the embedding matrix. The GPT grabs all the rows corresponding to the numbers in its input, does some stuff to them (more on this later), and mashes them together into an input matrix. This is the thing that’s going to have a ton of math done to it, which will eventually emit the next token.

Once you’ve got an input matrix, it’s passed through some number of layers. Each layer has a few stages. First there’s a stage called the attention layer (really a sub-layer), and then a stage called the feedforward layer. Attention layers are what makes it a transformer - good old fashioned deep learning networks also have feedforward layers.

The output at the end of a layer is a matrix of the same size as the input. So you can just keep passing a matrix into any number of these layers, and you’ll keep getting matrices of the same size.

Finally, your last matrix pops out of the final layer. You do some stuff to it to get a big set of numbers (specifically, one number for every token in the vocabulary), and then do something else called softmaxing to turn that set of numbers into a probability distribution over tokens. Finally, the GPT picks the next token from that probability distribution, and the token it picks is added to the end of the input sequence. The whole process is then repeated for the new input sequence, until the GPT has generated as many tokens as was asked for, or until the emitted token is a special “end of sequence” token, meaning that the output is complete.

So, the steps in the process (as I’m classifying them) are:

Get a vocabulary of tokens somehow, and put them in a database (tokenizer)
Convert the input into tokens, and grab the embedding matrix row for each token (embedding)
Run the resulting input matrix through some layers (layers)
1. First, the attention mechanism (attention)
2. Then, a feedforward layer (feedforward)
3. Also there’s some postprocessing that happens after one or both of these substeps, including residual connection and normalization (normalization)
Do one last transformation to turn the last layer’s output matrix into a different kind of output matrix, which can be used to predict the next token (final)
Turn that output matrix into a probability distribution (softmaxing)
Pick a next token, and repeat until done (recursion)

I’ll have a section for each of these, though, spoilers, 3a, 3c, and 4 are parts I still need to delve into, and so far don’t know that much about.

Finally, I’ll have a brief section after all these on training itself (as opposed to inference), but I’m not really there yet, either.

Tokenizer

First off, this process happens before anything else. Before training, before inference, before all of it. You have to have a way to convert your raw input into tokens, and therefore numbers, before the party starts.

What are desirable traits of a token? Technically, it’s totally possible to just make every letter a token, or every word. But there are way too many distinct words, and treating, say, “happen” and “happening” as completely distinct entities seems kind of silly. On the other hand, letters are a little too fine grained - you want to be able to just say confidently, “ing” comes next, rather than being forced to figure out that it’s “i” then “n” then “g” one at a time.

So, how can you strike a balance? One of the most popular tokenizers is called Byte Pair encoding. Byte Pair encoding basically works like this: start with all your letters; these are part of your vocabulary automatically. Then, check over some corpus to see which combinations of exactly two letters are most common. Take the most common ones, and add them to your vocabulary. Then, search the corpus to see which combinations of two members of you new vocabulary are most common (these combinations could have two, three, or four letters, depending on if the ingredients are the original letters or the new combinations). Keep repeating the process, getting progressively larger combinations, until you hit some prespecified vocabulary size. You’ll end up with all the most common short sequences of letters, which is great for token conversion.

Just assign each of the members of your Byte Pair encoded vocabulary a number in a database, and boom. You’ve got tokens.

Embedding

Now the real ballgame starts. We’ve received some text, we’ve converted it into tokens, and we’ve gotten the numbers corresponding to those tokens. For the rest of this walkthrough, let’s assume that we’ve converted “I like cats and” to the tokens [“I”, “ like”, “ cats”, “ and”], which correspond to the numbers [1, 7, 32, 68].

So, what do we do with our [1, 7, 32, 68]? We go to something called the embeddings matrix. The embeddings matrix is the first (but far from the last) time we’ll encounter a very special parameter called embedding dimension, or d for short. The embeddings matrix has a row for every single token in the vocabulary, and d is the amount of columns it has. This is a huge determinant of how big the transformer is overall, basically how many numbers are used to represent each individual token and all it stands for. In this example, we’ll say that the embedding dimension is 64.

Anyway, we grab the 1st, 7th, 32nd, and 64th rows of the embeddings matrix, and stack them on top of each other to make a new matrix, the input matrix. Or, to be more precise, almost the input matrix: we need to do one more thing to it first. Before we get into that, though, you should be picturing a matrix with four rows and 64 columns, a 4x64 grid of numbers. These numbers were learned in training, and they’re what’s going to move through the layers and eventually output our answer.

First, though, that last step I said we’d get to. Fun fact: transformers natively don’t have a concept of ordering. So “I like cats and” would, without intervention, be exactly the same as “cats and I like”. In English, word order matters quite a bit, so we need to give our input matrix some extra info about the order the tokens go in. To do this, we do something called positional encoding. More or less, we add some value to all the numbers in each row, corresponding to that token’s position in the sequence. Sine and cosine functions are typically used for this, because they oscillate, and that helps transformers differentiate sharply between adjacent tokens in a sequence.

I’m pretty fuzzy on the actual formula - it is on my list as I get deeper into the weeds. But for now, just know that the input matrix has some extra sine and cosine stuff done to it called positional encoding, and now we’re ready to pass through the network’s layers.

Layers

Finally. The meat of the sandwich. Here is where we take our representation of the input sequence, and do a ton of linear algebra stuff to it to figure out what comes next. We’re going to do so in two steps. The first step is called the attention layer, and the second step is called the feedforward layer.

The Attention Layer

Ok, so. This part is hard. It’s the most important bit, but my learning strategy has been going around the edges and picking off the easier parts first. Which is awkward for an explainer, because the attention part (specifically, only using the attention mechanism, rather than it plus some bells and whistles) is what makes it a transformer.

What do I know? I’ll just make a list:

The attention mechanism, once it has done its magic, emits a matrix with the same dimensions as our input matrix. So in our example, it takes in a 4x64 matrix, and emits a 4x64 matrix. But it does quite a bit in between!

The high level point of the attention mechanism is to tease out and incorporate high level patterns between various different parts of the input sequence. Each token takes all the other tokens into account, in some cool way that’s efficient.

This is done with something called “attention heads”, of which there can be lots. Big transformers have tons of attention heads, each attending to various patterns in the input.

So, yeah. Stay tuned. This part seems like it’ll be fun to learn, but I’m not there yet.

The Feedforward Layer

Let’s presume the magic of the attention layer is done, and we’re left with a new and improved 4x64 matrix. What now?

First of all, the feedforward layer contains two matrices, both of which were learned during training. We’re going to introduce a new parameter now, called the feedforward dimension, or, alternatively, the hidden dimension. We’ll abbreviate it as ff. It’s bigger than d - in our example, let’s say it’s 128, exactly twice as big.

The first feedforward matrix has dimension dxff, or, in our example, 64x128. The second feedforward matrix has dimension ffxd.

At a high level, the idea is that we’re going to multiply our input matrix by the dxff matrix, then do some fancy stuff to the output, then multiply that output by the ffxd matrix. Then we’re going to do even more fancy stuff, followed by more general postprocessing that I’ll get to in the next subsection.

Before I get into the fancy stuff, though, let’s explore why the two matrices have these particular dimensions. Basically, the secret comes down to the rules of matrix multiplication. At the end of each layer, we want to end up with a matrix of the same dimensions as the input. And if you multiply two matrices with dimensions (x, y) and (y, z), you end up with a matrix of dimensions (x, z). In other words, you take the first dimension from your first matrix, and the second dimension from the second.

So, when we multiply our input matrix (4x64) by the first matrix (64x128), we get a matrix of dimension (4x128). It’s bigger, which is the point of ff - now we can tease out more complex/nuanced patterns. But we want the output to end up smaller, so we multiply the (4x128) matrix by a (128x64) matrix, and now we’re back to (4x64), just the same as our input.

Now that that’s settled, what is the aforementioned fancy stuff? After multiplying the matrices, what else do we do? There are four things: bias, activation, residual connection, and normalization. We’ll get to the last two of those in the next subsection. But the time for bias and activation… is now.

Adding Bias

Remember how a feedforward layer contains two matrices? Well, each matrix also has an extra bonus vector, called the bias vector. We’ll call it b for short. The bias vector has ff elements for the first matrix, and d elements for the second. Once we’re done multiplying (either by the first or second matrix), we add its respective bias vector to every single row of the output. The point is basically to do a shift of some sort that doesn’t depend on the input, perhaps encoding information that you’d always want to consider. It also helps avoid some technical problems like vanishing gradients. I don’t know quite what those are, yet. In any case, the (respective) bias vector is added after both matrix multiplications.

Activations

Activations are up next. In general, in a feedforward layer, activations are only done after the first transformation. An activation is a function that is called on every single value in the matrix. A common one is ReLU, which stands for Rectified Linear Unit. When ReLU is given a negative value, it returns 0, and when it’s given a positive value, it leaves it alone. The point of ReLU, and activations in general, is to introduce nonlinearity into the model; basically, it allows the network to capture more complicated sorts of patterns that can’t be derived from ordinary matrix multiplication. Another way of thinking about it is that ReLU in particular causes lots of neurons (values in the output matrix) to simply be 0, and have no impact on future layers. Having a smaller subset of neurons active for any given inference is sometimes called sparseness, and is a desirable property - better to have a few neurons that activate reliably for specific concepts or situations, rather than every neuron always contributing its little bit of noise.

Feedforward Layer Retrospective

Activations happen after the bias vector is added. So, for those keeping score, here’s a list of all the steps in a feedforward layer.

First Linear Transformation (multiplying by the dxff dimension matrix)
Bias Addition (after the first linear transformation)
Activation Function (after the first bias addition)
Second Linear Transformation (multiplying by the ffxd dimension matrix)
Bias Addition (after the second linear transformation)
Residual Connection (after the second bias addition)
Normalization (after the residual connection)

We haven’t gotten to 6 and 7 yet. Those are next.

Normalization

After a layer’s operations are complete, there’s a bit of postprocessing before passing on to the next layer. First, there’s something called residual connection. In residual connection, you take the input matrix to the layer you just got through (in this case, the feedforward sublayer), and add it to the output. That’s then your new output. This helps the network learn more effectively during training, and, again, helps with the mysterious vanishing gradient problem.

Finally, there’s normalization. Basically, certain statistical techniques are applied to every row in the output matrix, causing the values to be more regularly distributed. I haven’t yet learned the precise operations involved, but I do know that every column has two new learnable parameters inside the normalization equation: one that scales values, and one that shifts them. (Also, these columns are sometimes called features - they’re the thing that there are d of for each token).

The point of normalization is the point of normalization in general: it smooths out the relative impact of each feature, so no random individual feature has too outsized an impact. This regularity also helps during training.

And that’s it! After normalization, we finally have the output matrix for the entire layer. The whole process repeats for as many layers as you’ve got, with the output of one layer being the input of the next, and always a matrix of dimension txd, where t is the number of tokens. In our example, that’s 4x64.

The Final Transformation

So, we’ve passed our input through lots and lots of layers, and ended up with a matrix of the same dimensions we started with - txd, or, in our example, 4x64.

However, what we’re actually trying to do at inference is predict the next token, out of all the tokens in the vocabulary. Let’s call the number of total tokens in the vocabulary V. We do one last matrix multiplication (aka linear transformation), this time by a matrix of dimension dxV, the numbers in which are all learned parameters. Much like feedforward matrices, this one has its own bias vector of dimension V, which we add to the output as usual.

Thus, by the laws of matrix multiplication that we touched on earlier, we now have a matrix of dimension txV. In other words, for every token in the input series, we have a number corresponding to every single token in the vocabulary.

During training, the row corresponding to each token would contain (in a way we’ll get to in the next section) the model’s guess about which token comes next. So the first row contains the prediction about the second token, the second token contains the prediction about the third, and so on. Thus, a GPT’s output at inference depends on the last row - that’s what contains its prediction for the as yet unseen token which comes next.

I find this pretty interesting! During inference (i.e. when you’re actually computing what token to output), every row of the final matrix except the last one doesn’t actually matter! But anyway, what we have at this point is still a whole bunch of numbers. How do we turn this into a guess for which token comes next?

Softmaxing

No, this isn’t when you buy a bunch of Squishmallows. At least, not in this context.

Let’s consider where we’ve left off. We ended up with an enormous matrix, with one row for each token in our input, and one column for each token in the entire vocabulary. Since we’re at inference, all we’re going to use to make our guess is the very last row - in other words, a set of V numbers. They can be positive or negative, and arbitrarily large or small. These numbers, in this context, are called logits. We have V logits.

What we want, in fact, is a probability distribution. In other words, we want every token in the vocabulary to be assigned a probability that it comes next, with these probabilities adding up to 1. Then we can pick one: perhaps the single most likely option, or perhaps one at random from the whole distribution. Where we fall on that particular continuum is called temperature, which will get its own subsection in a moment. But first, how do we turn our V logits into a probability distribution in the first place? By softmaxing!

Softmaxing is the following mathematical function:

If you find this scary, sorry. It’s actually a lot simpler than it looks. Basically, you do the following: for every logit, you take the number e (Euler’s number - it’s about 1.618), and make it the base of an exponent, where the exponent itself is that logit. So if the logit was 2, now you’ve got e^2. If the logit is -100, now you’ve got e^100. You take each of these individual values (e to the power of some logit), and divide each one by the sum of all of them combined.

So if your logits are [1, 7, -143.6, and 0], these four numbers softmaxed (in the context of this set) become:

$[\frac{e^{1}}{e^{1} + e^{7} + e^{- 143.6} + e^{0}}, \frac{e^{7}}{e^{1} + e^{7} + e^{- 143.6} + e^{0}}, \frac{e^{- 143.6}}{e^{1} + e^{7} + e^{- 143.6} + e^{0}}, \frac{e^{0}}{e^{1} + e^{7} + e^{- 143.6} + e^{0}}]$

Of course, e^0 is just 1, so we can simplify a bit to:

$[\frac{e}{e + e^{7} + e^{- 143.6} + 1}, \frac{e^{7}}{e + e^{7} + e^{- 143.6} + 1}, \frac{e^{- 143.6}}{e + e^{7} + e^{- 143.6} + 1}, \frac{1}{e + e^{7} + e^{- 143.6} + 1}]$

See? Easy peasy.

What’s the point of all these shenanigans? Well, now all our values are all positive (because e to any exponent is positive), and they all add up to one (because we divide every single numerator by the sum of all of them). These are exactly the traits of a probability distribution.

Other nice things about softmaxing is that the higher the logit, the larger the softmaxed value. A logit of value -10,000 will generate a way smaller probability than one of 17, which will generate a way smaller probability than one of 158,000. Bigger numbers stay bigger, they just also all behave and play nicely with each other as a single distribution. Hooray. (This may be my favorite part.)

Anyway, now that we have a probability distribution over all the tokens in the vocabulary, we’re basically done; we just have to pick one. If we want the highest chance of success, we can just pick whichever is biggest. But if we want our autoregressive transformer to be a little spunky or surprising, we can make it less deterministic than that. How? With something called temperature.

Temperature

The temperature of a model is basically how likely it is to say the most obvious next token, versus something else. A token with 0 temperature always says the single possibility that’s most likely. A token with very high temperature picks totally at random. The higher the temperature, the wackier the behavior.

How does temperature work mathematically? It’s actually quite elegant. Rather than softmaxing over the logits themselves, an autoregressive transformer with temperature T softmaxes over each logit divided by T.

In other words, the actual equation, with temperature included, is:

$softmax (z_{i}) = \frac{e^{z_{i} / T}}{\sum_{j = 1}^{V} e^{z_{j} / T}}$

If T is high, the exponents squish down to be closer together, and so the probability distribution is more flat - for very high values, every token is pretty close to equally likely. And as T approaches 0, the highest exponents so far outstrip the lowest ones that it’s vanishingly unlikely anything but the most likely next token is picked. Technically, 0 temperature as written here is undefined, but by convention it just picks the most likely token with 100% probability.

Recursion/Conclusion

And… that’s it! Once you’ve sampled a token from the softmaxed probability distribution, you pop it to the end of the input sequence, and that’s your new input sequence. Rinse and repeat, until you’ve generated enough tokens. Any time you’re chatting with a GPT, some version of this is what’s happening under the hood.

Like I mentioned way back at the beginning, I’m only about a week into my studies here, and there’s a ton left to learn. For example:

How do the attention layers, arguably the most important and complicated part, work?
What are the actual, detailed layer normalization equations?
How does training work? Specifically:
1. How does backpropagation work? I know (partial?) derivatives are involved, but not too much else
2. What’s an optimizer? One’s called Adam, and it’s a version (?) of Stochastic Gradient Descent, which I only understand in a really hand wavey way
3. How do hyperparameters like learning rate interact with all this?

So yeah. This is really just a beginning. But it feels really good to actually figure out how these influential systems work, and I now feel like, once I get a handle on attention heads, I’ll have some idea of what a GPT system is actually doing when it tells me how softmaxing works.