Walkthrough: The Transformer Architecture [Part 2/2] — LessWrong

x

Walkthrough: The Transformer Architecture [Part 2/2] — LessWrong

If you are already sort of familiar with the Transformer, this post can serve as a standalone technical explanation of the architecture. Otherwise, I recommend reading part one to get the gist of what the network is doing.

Yesterday, I left us with two images of the Transformer architecture. These images show us the general flow of data through the network. The first image shows the stack of encoders and decoders in their bubbles, which is the basic outline of the Transformer. The second image shows us the sublayers of the encoder and decoder.

Now, with the picture of how the data moves through the architecture, I will fully explain a forward pass with an example. Keep the general structure in mind as we go through the details, which can be a bit mind-numbing at parts.

The task

The Transformer is well suited for translating sentences between languages. Since I don't speak any language other than English, I have decided to translate between sarcastic sentences and their intended meaning. In the example, I am translating the phrase "Yeah right" to "That's not true." This way all those robots who only understand the literal interpretation of English can simply incorporate a Transformer into their brain and be good to go.

The embedding

Before the sentence "Yeah right" can be fed into the architecture, it needs to take a form that is understood by the network. In order to have the architecture read words, we therefore need embed each word into a vector.

We could just take the size of the vocabulary of the document we are working with, and then one-hot encode each word into a fantastically sparse and long vector. But this approach has two problems:

1. The high dimension that these vectors are in makes them harder to work with.

2. There's no natural interpretation of proximity in this vector space.

Ideally, we want words that are similar to be related in some way when they are embedded into a vector format. For instance, we might want all the fruits {apple, orange, pear} to be close to each other in some cluster, and far enough away from the vehicle cluster {car, motorcycle, truck} to form distinct groups.

To be brief, the way that we do this is to use some word embedding neural network that has already been trained on English sentences.

Positional encoding

After we have obtained a set of word-embedded vectors for each word in our sentence, we still must do some further pre-processing before our neural network is fully able to grasp English sentences. The Transformer doesn't include a default method for analyzing the order of words in a sentence that it is fed. That means that we must somehow add the relative order of words into our embedding.

Order is of course necessary because we don't want the model to mistake "Yeah right" with "Right yeah." A mistake like that could be so catastrophic that we would get the opposite result of what we want from the network.

Before I just show you how positions are encoded, let me first show you the naive way that someone might do it, and how I first thought about it. I thought the picture would somehow look like this.

Here I have displayed the word embedding as the left four entries in the vector, and concatenated it with the position of the word in the sentence. I want to be very clear: this is not the way that it works. I just wanted to show you first the bad first pass approach before telling you the correct one, just to save you the trouble of being confused about why it doesn't work the way you imagined it in your head.

The correct way that the positional encoding works is like this.

Rather than just being a number, the positional encoding is another vector whose dimension is equal to the word embedding vector length. Then, instead of concatenating, we perform an element wise vector addition with the word embedding.

The way that we compute the positional encoding vector is a bit tricky, so bear with me as I describe the formula. In a moment, I will shed light on how this formula will allow the model to understand word order in a justified manner. We have $P E_{(p o s, 2 i)} = s i n (p o s / 10000^{2 i / d_{model}}) P E_{(p o s, 2 i + 1)} = c o s (p o s / 10000^{2 i / d_{model}})$

Where PE stands for "positional encoding," pos refers to the position of the word in the sentence, and i stands iterator which we use to construct this vector. i runs from 0 to $d_{m o d e l} / 2.$

Let me show you how this works by applying this formula on the example sentence. Let's say that $d_{m o d e l} = 4,$ the same dimension as I have shown in the picture above. And further, let's encode the word "right" in the sentence "Yeah right," which is at position 1. In that case, then we will compute that the positional encoding is the following vector: $[s i n (p o s / 10000^{(2 * 0) / 4}), c o s (p o s / 10000^{(2 * 0) / 4)}), s i n (p o s / 10000^{(2 * 0) / 4}), s i n (p o s / 10000^{(2 * 0) / 4)}))]^{⊺} = [s i n (p o s), c o s (p o s), s i n (p o s / 100), s i n (p o s / 100))]^{⊺} = [s i n (1), c o s (1), s i n (1 / 100), s i n (1 / 100))]^{⊺}$ Why does the positional encoding use this formula? According to the paper,

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k, P E_{p o s + k}$ can be represented as a linear function of $P E_{p o s}$ .

In other words, if we take the positional encoding of any word, we can easily construct the positional encoding of some other word elsewhere in the sentence by some linear scaling, rotation and stretching the encoding vector.

To see why this is useful, consider the fact that the word embedding allowed us to group words together which were close in meaning to each other. Now we have a way of transforming the vector in a linear fashion in order to obtain the position of the another word in the sentence. This means that we have a sort of language where a linear function translates to "the last word" or "the word 10 words ago." If we add these two encodings, the hope is that the model should be able to incorporate both the relative meaning of the words, and the relative positions, all in a single compact vector. And empirically, the model seems to be able to do just that.

Now that these two ways of representing the vector are combined, the words are ready to interact by being fed into the self-attention module.

Self-attention

Recall that the whole point of the Transformer architecture is that it is built on attention modules. So far we have only focused on processing the word vectors in a way that will be understood by the model. Now, we must understand what is done to these embedded vectors.

As we can see in the image above, each vector in the sentence is fed through the self attention module. In turn, the self attention module takes this sequence and returns yet another sequence, ready to be handed off to the next part of the network (shown as the vectors $z_{1}$ and $z_{2}$ ). This new sequence can be seen as a modified version of the original sequence, with some new information baked into each word. These new vectors contain information about how the words relate to each other.

In the last post, I described how self attention allowed the Transformer to construct links between words, which are intended to represent the relationship between words. However, it still isn't clear exactly how we find the strength of these links.

In order to create the links, we first must understand the idea of keys, queries and values. I will borrow the analogy that nostalgebraist uses in their explanation of the Transformer. You can imagine keys as vectors which represent a dating profile for a word, which includes information that might be useful for connecting them to other words. Queries, by contrast, are vectors which include information about what each word is looking for in each other word. Values are vectors which contain some more information about the word that isn't already represented. Our job is to use these keys and queries, which are learned for each word, in order to construct a table of matches. Each link in the table represents the relationship status between the words. Once we have these links, the strength of these links allow us to weight the words by their values. By doing this, each word sort of becomes an infusion of all the words they are dating, which could include themselves.

That is quite a lot to take in. Luckily, I will go through the process of constructing keys, queries and values, and the appropriate weighting, step by step.

The first thing to understand is that the keys and queries are not explicitly represented in the Transformer. Instead keys, queries, and values are constructed by multiplying the embedded vectors by a matrix of weights, which is directly learned. This is how I visualize the process.

Here, I have repeated the input vector for the first word three times, once for each computation. It is multiplied by three separate matrices, in order to obtain the keys, queries and values. $W^{K}$ represents the weights for generating the keys. $W^{Q}$ is the matrix for queries, and $W^{V}$ is the matrix for generating values.

Once we have our keys, queries, and values for each word, we are ready to link each word up, or in the analogy, find their dates.

Since the dot product can be viewed as a way of measuring the similarity between vectors, the way that we check whether two words are compatible is by calculating the inner product between their keys and values. This dot product will represent the score, or compatibility between two words. In particular, the score between the first word and the second word is the dot product $q_{1} \cdot k_{2} .$ If we were to consider the general case, then the score between word n and word m would be $q_{n} \cdot k_{m} .$ Remember, these scores are not symmetric. This asymmetry is due to the fact that the score from word n to word m is calculated differently than the score from word m to word n. By analogy, just because one words likes another word doesn't mean that they are liked back!

Once we have the scores from every word to every other word, we now must incorporate that information into the model. The way we do this is the following:

First, we divide the scores by the square root of the dimension of the key. Then we take a softmax over the scores for each word, so that they are all positive and add up to one. Then, for each word, we sum over the values of every other word, weighted by the softmax scores. Then we are done. Here is what that looks like for the first word in our sentence.

The intuitive justifications for each of these steps may not first be apparent. The most confusing step, at least to me at first, was why we needed to divide the score by the square root of $d_{k} .$ According to the paper, this makes gradients easier to work with. The reason is because for large values of $d_{k}$ , the dot products can become so large that the softmax function ends up having extremely small gradients.

Once we have these weighted values, we are ready to pass them off to the next part of the Transformer. But not so fast! I temporarily skipped over a key component to the Transformer, and what allows it to be so powerful.

The step I missed was that I didn't tell you that that there's actually multiple keys, queries, and values for each word. This allows the model to have a whole bunch of different ways to represent the connections between words. Let me show you how it works.

Instead of thinking about finding the sum over values for each word, we should instead think about multiple, parallel processes each computing different sums over values. Each attention head computes a value (the weighted sum), using weight matrices which were all initialized differently. Then when all of the attention heads have calculated their values, we combine these into a single value which summarizes all the contextual information that each attention head was able to find.

In order to combine the values found by the attention heads, we concatenate each of the values, and multiply the concatenated matrix by another learned matrix. We call this new learned matrix $W^{O}$ for the output weights. Once completed, we now have the final output word in this attention mechanism. This process looks like this.

Add and norm

Now, it looks like we're ready to get to the next part of the encoder. But before we do so, I must point out that I left something out of my initial picture. Before we can put $z_{out}$ into the feed forward neural network, we must apply a residual connection first. What is a residual connection? It looks a bit like this.

In other words, the inputs are fed into the output of the self-attention module. The way that this works is by implementing a layer normalization step to the sum of the outputs and inputs. If you don't know what layer normalization is, don't worry. Layer normalization is an operation based on batch normalization, intended to reduce some of batch normalization's side effects. Tomorrow's post will cover batch normalization in detail, so the exact specification is omitted here.

Once we have added and normed, we are finally ready to push these words onto the neural network. But first, a note on notation.

Notation

This will likely come as no surprise, but the way that the above operations are performed are above the level of vectors. In order to speed up computation and simplify the way we write the algorithm, we really have to put all the steps into higher order arrays. This notation can indicate which operations are done in parallel.

In practice, this means that instead of writing the multi-head attention as a bunch of simultaneous steps, we simplify the whole routine by writing the following expression.

$Attention (Q, K, V) = softmax (\frac{Q K^{⊺}}{\sqrt{d_{k}})}) V MultiHead (Q, K, V) = Concat ({head}_{1}, . . ., {head}_{h}) W^{O} {where head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$

Here, the matrices Q, K, and V stand for the matrices of queries, keys and values respectively. The rows in the matrices are the vectors of queries, keys and values, which we have already discussed above. It should be understood that any step which allows for some form of parallelization and simplification, a matrix representation is preferred to vectors.

The feed forward neural network

We are almost done with one slice of an encoder. Even though most of the time was spent on the previous step (the attention and norm sublayer), this next one is still quite important. It is merely simpler to understand. Once we have the words generated by the self attention module, and it is handed to the layer-norm step, we must now put each word through a feed forward neural network.

This is an important point, so it's important not to miss: the same neural network is applied independently to each of the words produced by the previous step. There are many neural networks at this slice of the encoder, but they all share weights. On the other hand, neural networks between layers do not share weights. This allows the words to solidify their meaning from some non-linearity in the neural network before being shipped off to the next step, which is an identical slice of the encoder.

The stack

As mentioned before, the encoder and decoder are made up of a stack of attention mechanism/neural network modules. Words are passed through the bottom of a single slice of this stack, and words appear out the top. The same number of words are outputted from the encoder stack as inputted from the encoder stack. In other words, the vectors pass through the layers in the same shape that they started. In this sense, you can trace a single word through the stack, going up from the encoder.

Encoder-decoder crossover

At this point, the encoder is pretty much fully explained. And with that, the decoder is nearly explained, since the two share most properties. If you have been following this far, you should be able to trace how an embedded sentence is able to go through an encoder, by passing through many layers, each of which have a self attention component followed by a feed-forward neural network. The way that data travels through a decoder is similar, but we still must discuss how the encoder and decoder interact, and how the output is produced.

I said in the previous post that when the encoder produces its output, it somehow is able to give information to each of the layers of the decoder. This information is incorporated into the encoder-decoder attention module, which is the main difference between the encoder and decoder. The exact mechanism for how this happens is relatively simple. Take a look at the decoder structure one more time.

I have added the box on the left to clarify how the decoder takes in the input from the encoder. In the encoder-decoder attention sublayer, the keys and values are inherited from the output of the last encoder layer. These keys and values are computed the same way that we computed keys and values above: via a matrix of weights which act on the words. In this case, just as in the last, there is a different matrix acting on each layer, even as these matrices act on the same set of words: the final output of the encoder. This extra module allows the decoder to attend to all of the words in the input sequence.

In line with the previous description, the values in the encoder-decoder attention sublayer are derived from the outputs of the self-attention layer below it, which works exactly the same as self-attention worked in the encoder.

In the first pass through the network, the words "Yeah right" are passed through the encoder, and two words appear at the top. Then we use the decoder to successively work from there, iteratively applying self attention to an input (which input? this will be elaborated in just a moment) and plugging its output into the encoder-decoder attention. The encoder then attends to the output from the encoder, along with the output from self-attention. This is then fed into a neural network step, before being handed off to the next decoder in the stack. At the top of the decoder, we feed the output one last time through a linear layer and a softmax layer, in order to produce a single word, the output of the network on a single step.

Then, after a single word is produced, we put that back into the input of the decoder and run the decoder again, now with its ouput being parsed as input. We stop whenever the decoder outputs a special symbol indicating that it has reached the end of its translation, the end token. This process would look end like this.

Before the decoder has produced an output, we don't want it to be able to attend to tokens which have not been seen yet. Therefore, in our first pass through the decoder, we set all the softmax inputs in self-attention to $- \infty$ . Recall that when the softmax is applied, we use the resulting output to weight the values in the self-attention module. If we set these inputs to the softmax to $- \infty$ , the result is that the softmax outputs a zero, which correspondingly gives those values no weight. In future passes, we ensure that every input that corresponds to a future word is treated similarly, so that there cannot be connections between words that we have not outputted yet. $^{1}$

Now, with the mechanism for producing outputs fully elucidated, we are ready to stare at the full network, with all of its tiny details. I do not want to compete with the original paper for illustrations on this one. Therefore, I have incorporated figure 1 from the paper and placed it here.

As you might have noticed, I neglected to tell you about some extra residual connections. Regardless, they work the same way that I described above, so there should be no confusion there! Other than that, I have covered pretty much part of the Transformer design. You can now gaze your eyes upon the above image and feel the full satisfaction of being able to understand every part (if you've been paying attention).

Of course, we haven't actually gone into how it is trained, or how we generate the single words from the final softmax layer and output probabilities. But this post is long enough, and such details are not specific to the Transformer anyway. Perhaps one day I shall return.

For now, I will be leaving the transformer architecture. In tomorrow's post I will cover batch normalization, another one of those relatively new ML concepts that you should know about. See you then.

$^{1}$ If this discussion of exactly how the decoder incorporates its own output, and masks illegal connections confuses you, you're not alone. I am not confident that these paragraphs are correct, and will possibly correct them in the future when I have a better understanding of how it is implemented.