This is the first post in a new sequence in which I walk through papers, concepts, and blog posts related to machine learning and AI alignment.

In the process of increasing my own understanding of these topics, I have decided to implement a well known piece of advice: the best way to learn is to teach. This sequence is therefore a culmination of what I have learned or reviewed very recently, put into a learning format.

My thoughts on various topics are not polished. I predict there will be mistakes, omissions, and general misunderstandings. I also believe that there are better places to learn the concepts which I walk through. I am not trying to compete with anyone for the best explanations. The audience I have in mind is a version of myself from a few weeks or days ago, and therefore some things which I take for granted may not be common knowledge to many readers. That said, I think this sequence may be useful for anyone who wants a deeper look at the topics which I present.

If you find something wrong, leave a comment. Just try to be respectful.


If you've been following machine learning or natural language processing recently, you will likely already know that the Transformer is currently all the rage. Systems which are based on the Transformer have struck new records in natural language processing benchmarks. The GLUE benchmark is currently filled by models which, as far as I can tell, are all based on the Transformer.

With the Transformer, we now have neural networks which can write coherent stories about unicorns in South America, and an essay about why recycling is bad for the world.

The paper describing the Transformer, Attention Is All You Need, is now the top paper on Arxiv Sanity Preserver, surpassing the popularity of GANs and residual nets. And according to Google scholar, the paper has attracted 2588 citations (and counting).

So what makes the Transformer so powerful, and why is it such a break from previous approaches? Is it all just hype? Clearly not. But it is worth looking into the history behind the architecture first, in order to see where these ideas first came from. In this post, I give a rough sketch for what the transformer architecture looks like. In the following post, I will provide a detailed description of each step in the forward pass of the architecture. First, however, we have to know what we are looking at.

The central heart of the Transformer is the attention mechanism, hence the name of the original paper. As I understand, attention is a mechanism first designed in order to improve the way that recurrent neural networks understood text. I found this post helpful for providing an intuitive breakdown of how the attention mechanism works in pre-Transformer architectures.

The way I view attention in my head, and the way that many people illustrate it, is to imagine a table linking every word in a sentence to every other word in another sentence. If the two sentences are the same, then this is called self-attention. Attention allows us to see which parts of the sentence are relevant to the other parts. If there's a strong link between "it" and "car" then it's possible that this link is a way for the model to say that it is the car.

Consider the sentence "I am a human." Attention might look at this sentence and construct the following links:

The brighter the cell in the table, the more connected the two words are. Here, the exact meaning of the shades of grey aren't really important. All you need to notice is that there's some sort of relationship between words, and this relationship isn't identical between every word. It's not a symmetric relationship either. The exact way that attention is calculated will come later, which should shed light on why attention works the way it does.

If we were using a Transformer on the previous sentence, it might figure out that "I" was referring to "human." And in a sense, that's exactly what we want it to do. "I" and "human" are quite related, since they both point in the same direction. We will later see how we can use these linkages in a clever way to incorporate them into the Transformer architecture.

As a brief digression, I will note that the idea of coming up with a vague abstraction for how we want to interpret our neural networks is a concept that comes up repeatedly in deep learning. In the case of convolutional neural networks, for instance, we frequently describe neural networks as having some sort of structured model for the images we are classifying. This is why authors will sometimes talk about the model recognizing smaller pieces of the picture, like wheels, and then combining these smaller features into a coherent whole, such as a car.

The way that I think about attention is similar. The hope is that the neural network will be able to capture those parts of the text that are related to each other in some way. Perhaps the words that are strongly linked together are synonyms, or stand for each other in some way, such as because one of them is a pronoun of another word. No part of the links are hard-coded, of course. Finding out which words are related is where most of the learning happens.

Unlike previous approaches that use attention, the Transformer is unique because of how it uses attention for virtually everything. Before, attention was something that was used in conjunction with another neural network, allowing some form of computation by the neural network on a a pre-processed structured representation. The Transformer takes this concept further by repeatedly applying attention to an input, and relying very little on traditional feed forward neural networks to turn that input into something useful. Transformers still use regular neural networks, their importance is just diminished. And no RNNs or CNNs need be involved.

Other than attention, the other main thing to understand about the Transformer is its general structure, the encoder-decoder architecture. This architecture is not unique to the Transformer, and from what I understand, has been the dominant method of performing sequence to sequence modeling a few years before the Transformer was ever published. Still, it is necessary to see how the encoder-decoder architecture works in order to get any idea how the Transformer does its magic.

Below, I have illustrated the basic idea.

Simple enough, but how does it work? That middle arrow is conveying something useful. This isn't just a deep architecture with a strange standard representation.

By thinking first about traditional neural networks, the justification for having this representation will become clear. In an RNN, we have the same neural network performing a computation several times on each part of a sequence, returning an output for each step. It is this sequential nature of the computation which limits the output of an RNN. If we want an RNN to translate a sentence in English with 5 words in it, to a sentence in French with 7 words in it, there seems to be no natural way to do it. This is because, intuitively, as we go along the input sequence, we can only translate each element in the input to one element in the output.

By contrast, the encoder-decoder mechanism gets around this by constructing two different networks which work together in a special way. The first network, the encoder, takes in an input and places that input into a very high dimensional space, the context. After the input has been translated into this high dimensional space, it is then put into a much lower dimension using the decoder. By having an initial ramp up to a high dimension, and then back down, we are free to translate between sequences of different lengths.

The exact way that it works is like this: in the decoder network, we first take in the context and an initial hidden state as inputs. We then repeatedly apply an RNN to these inputs, creating an output and passing along the hidden state information to the next step. We repeat this until the network has produced an end token, which indicates that it has reached the end of the translated text.

In the Transformer, we drop the RNN part, but keep the benefit of being able to map arbitrarily long sequences to other arbitrarily long sequences. Under the hood, the Transformer is really more of a stack of encoders and decoders, which are themselves composed of self-attention components and a feedforward neural network. The size of this stack is the main design choice involved in creating a Transformer, and contributes to the simplicity of tuning it.

Since there are so many small details in the Transformer, it's important to get a rough visual of how data is flowing through the network before you start to think about the exact computations that are being performed. We can look at the whole model like this.

The little stacked boxes I have put in encoder and decoder represent the layers in the network. The input is first fed up through the encoder, and then at the top step, we somehow provide some information to each layer of the decoder. Then, we start going through the decoder, at each step using the information we just got from the encoder.

Unlike an RNN, we do not share weights in each of these layers. The little boxes represent individual parts of the architecture that we are training separately. If we look under the hood at these little blocks, we find that encoder and decoder blocks are pretty similar.

These things are stacked inside the encoder and decoder, and feed upwards.

On the left, we have an encoder layer. This layer has two sublayers. The first layer is the aforementioned self-attention. Although I have not given the details yet as to how attention works, we can visualize this block as calculating the values of the table I have above, followed by some computation which uses the numbers in the table to weight each of the words in the sequence. This weighting then bakes in some information into each word, before it is used by the feed forward network. If this currently doesn't make sense, it should hopefully become more apparent once we get into the actual vector and matrix operations.

On the right, we have a decoder layer. The decoder layer is almost identical to the encoder layer except that it adds one middle step. The middle step is the encoder-decoder attention. This sublayer is the one that's going to use the information carried over from the last step of the encoder layer.

Both the encoder and decoder layers feed into a neural network before shipping the values onto the next step.

If you're anything like me, looking at this very conceptual picture has probably left you eager to dive into a more concrete analysis in order to truly understand what's going on. For that, we'll need to go into each step, and look at the architecture like an algorithm, rather than a picture. Part 2 of this post will do just that.

Just keep in mind that as long as we have a rough idea of what's going on, it will make all the little matrix computations and notation a little less painful. In my opinion, learning benefits immensely from an abstract first pass through the material followed by a more detailed second pass. So with that, join me tomorrow as I unravel the deeper mysteries of the Transformer.

New to LessWrong?

New Comment