What are transformers, and some ways to think about them.
Modern LLMs are Transformers. An LLM is a transformer in the same way that a building might be a ‘Art Deco building’ – specific instances of a general architecture. Like with Art Deco, there are many variants of buildings that would still be considered the same style. Similarly, while there are many different variants of transformer-based LLMs, they all follow a similar architecture. However, not every building is a Art Deco building. And there exist LLMs that are not transformers. That being said, with modern LLMs today, virtually all decent models are transformers.
Art Deco isn’t just an architectural style. You also have fashion, jewelery, furniture – even fonts! – that we call Art Deco. Similarly, this same transformer architecture, can also be applied to various models beyond text. You can have transformers trained to understand images, parse DNA sequences, generate audio, and so on.
How can transformers do all these different things? Well, at its core, a transformer is a next token prediction machine. In text, tokens are fragments of words. We feed the model a string of input text. It reads it all at once, and comes up with a prediction for the word-fragment that’s most likely to follow everything that’s come so far. Then you take this new token, glom it onto the rest, and pass the entire thing through the model again. And you repeat this until you’ve got as many tokens as you want.
This process isn’t limited to text, though. As long as you can convert something into tokens – sequence-fragments – you can put it through a transformer. DNA sequences can be broken up into smaller lengths; audio can be split into shorter snippets. I don’t know it works for images, but there’s certainly some clever way to segment them up as well.
——
“Transformer” is a confusing term, because there are two things it can mean. A transformer refers to, both, the architecture of the model, and a single layer within the model. It’s like how you’d call the thing in the image “a wafer”, and you’d also call a single layer of that “a wafer”. And just like with wafers, a transformer model is composed of many transformer layers stacked on top of each other.
But unlike a wafer, a transformer lets stuff through. The metaphor here is information flow. We start with some input information – our text – that begins all the way upstream. Once we feed it into the model, this river travels down, layer by layer, until at the very end, when it’s been transformed into the next most-likely token.
Our transformer, however, doesn’t just predict the token that comes after the entire text. It also predicts tokens that are likely to follow each sub-sequence in the text. “What comes after ‘Our’ ”? “What comes after ‘Our transformer’ ”? And so on. This felt weird when I first saw this. Why is it bothering with all these intermediate predictions, when we only care about the last one? Wouldn’t it be much quicker to skip all these?
The way it works, actually, is we get all these intermediate predictions for free. It helps to imagine it as a river. Our river starts off very wide – it’s as wide as the entire text we feed in. And, because of how a transformer works, this river continues to remain wide as it flows downstream. Each rivulet is a single token, as it flows down the steppe. And all these streams go through the same layers of alchemical transformation, until, at the very end, each individual stream has now morphed into the token that best comes next.
—— ——
A follow-up question, here, is how do all these individual streams interact? What’s going on with this river? Can we split these streams and say something about how it flows? Or is the river turgid & bubbly, with the details being lost in the froth?
Each single transformer layer, has two parts. The first part moves information between streams; the second part processes it. Movement, here, is the critical step – it’s what makes transformers so effective. Say, the last token of my input is “bank.” If can’t see any of the words that come before it, then I don’t really know what the word means. Is this a river bank or a bank that loans money? If it’s the former, is it muddy, dried up, or covered in grass? Why do I care about the bank? Am I reading a geography textbook, an environmental report, or an idyllic folk tale? Without all this context – context that requires moving information – a transformer only knows about that one, single token. In that case, its best prediction strategy would be learning bigram patterns.
So to predict the next token, the model needs to soak up the context of everything that has come before it. The first half of each layer is responsible for streams crossing over and mixing information around. The second half, then, processes & assimilates everything new added in. Repeat this dozens of times across the length of the river, and you’ve got yourself a transformer.
——
There’s something unsettling about all this happening in parallel. “The river bank was muddy”. Not only does the model have to understand that muddy refers to the bank, but also this was a river bank in particular. It’s simultaneously unraveling the meaning of each word, by looking at other words; and it’s understanding those by looking at yet other words, and so on. It’s a weird, bootstrappy, hermeneutic unpacking – which somehow manages to work.
A consequence of this is a model can be bottlenecked by how many layers it has. Fewer layers means fewer steps to transform the information successfully. Say I give a model with 5 layers, a text of 1000 words. Then, at most, it’ll be able to make five hops between pieces of information. Depending on how complex the text is, this might not be enough to understand everything fully.
This parallelism also means, qualitatively, LLMs process text differently than us. When you or I read a book, we do so serially – one sentence at a time. This takes a while. However, as we read, we’re also able to process and assimilate previous parts of the text. This means we build richer connections between the information we’ve read. Our network is dense. LLMs, though, process all this text in parallel. While they’re *much* quicker than us, they’re also limited by how deeply they can process it.
It’s like how you or I might glimpse at a painting. We’re able to absorb everything at once, and form a quick impression of what’s going on. We get a sense of the subject matter, the colors, the composition, what strikes us about the painting – all from one glimpse. Similarly, LLMs process text, especially large amounts of it, in an impressions-based, intuitive manner. They can quickly summarize & convey the gestalt of what’s going on. What’s tougher, is the details. This is partly why ‘reasoning models’ have been so successful. They give LLMs a ‘scratchpad’ to process & understand the information more deeply; a ‘system two’ to complement the default ‘system one’.
Intuitions for reasoning models, though, is a topic for another post.
What are transformers, and some ways to think about them.
Modern LLMs are Transformers. An LLM is a transformer in the same way that a building might be a ‘Art Deco building’ – specific instances of a general architecture. Like with Art Deco, there are many variants of buildings that would still be considered the same style. Similarly, while there are many different variants of transformer-based LLMs, they all follow a similar architecture. However, not every building is a Art Deco building. And there exist LLMs that are not transformers. That being said, with modern LLMs today, virtually all decent models are transformers.
Art Deco isn’t just an architectural style. You also have fashion, jewelery, furniture – even fonts! – that we call Art Deco. Similarly, this same transformer architecture, can also be applied to various models beyond text. You can have transformers trained to understand images, parse DNA sequences, generate audio, and so on.
How can transformers do all these different things? Well, at its core, a transformer is a next token prediction machine. In text, tokens are fragments of words. We feed the model a string of input text. It reads it all at once, and comes up with a prediction for the word-fragment that’s most likely to follow everything that’s come so far. Then you take this new token, glom it onto the rest, and pass the entire thing through the model again. And you repeat this until you’ve got as many tokens as you want.
This process isn’t limited to text, though. As long as you can convert something into tokens – sequence-fragments – you can put it through a transformer. DNA sequences can be broken up into smaller lengths; audio can be split into shorter snippets. I don’t know it works for images, but there’s certainly some clever way to segment them up as well.
——
“Transformer” is a confusing term, because there are two things it can mean. A transformer refers to, both, the architecture of the model, and a single layer within the model. It’s like how you’d call the thing in the image “a wafer”, and you’d also call a single layer of that “a wafer”. And just like with wafers, a transformer model is composed of many transformer layers stacked on top of each other.
But unlike a wafer, a transformer lets stuff through. The metaphor here is information flow. We start with some input information – our text – that begins all the way upstream. Once we feed it into the model, this river travels down, layer by layer, until at the very end, when it’s been transformed into the next most-likely token.
Our transformer, however, doesn’t just predict the token that comes after the entire text. It also predicts tokens that are likely to follow each sub-sequence in the text. “What comes after ‘Our’ ”? “What comes after ‘Our transformer’ ”? And so on. This felt weird when I first saw this. Why is it bothering with all these intermediate predictions, when we only care about the last one? Wouldn’t it be much quicker to skip all these?
The way it works, actually, is we get all these intermediate predictions for free. It helps to imagine it as a river. Our river starts off very wide – it’s as wide as the entire text we feed in. And, because of how a transformer works, this river continues to remain wide as it flows downstream. Each rivulet is a single token, as it flows down the steppe. And all these streams go through the same layers of alchemical transformation, until, at the very end, each individual stream has now morphed into the token that best comes next.
—— ——
A follow-up question, here, is how do all these individual streams interact? What’s going on with this river? Can we split these streams and say something about how it flows? Or is the river turgid & bubbly, with the details being lost in the froth?
Each single transformer layer, has two parts. The first part moves information between streams; the second part processes it. Movement, here, is the critical step – it’s what makes transformers so effective. Say, the last token of my input is “bank.” If can’t see any of the words that come before it, then I don’t really know what the word means. Is this a river bank or a bank that loans money? If it’s the former, is it muddy, dried up, or covered in grass? Why do I care about the bank? Am I reading a geography textbook, an environmental report, or an idyllic folk tale? Without all this context – context that requires moving information – a transformer only knows about that one, single token. In that case, its best prediction strategy would be learning bigram patterns.
So to predict the next token, the model needs to soak up the context of everything that has come before it. The first half of each layer is responsible for streams crossing over and mixing information around. The second half, then, processes & assimilates everything new added in. Repeat this dozens of times across the length of the river, and you’ve got yourself a transformer.
——
There’s something unsettling about all this happening in parallel. “The river bank was muddy”. Not only does the model have to understand that muddy refers to the bank, but also this was a river bank in particular. It’s simultaneously unraveling the meaning of each word, by looking at other words; and it’s understanding those by looking at yet other words, and so on. It’s a weird, bootstrappy, hermeneutic unpacking – which somehow manages to work.
A consequence of this is a model can be bottlenecked by how many layers it has. Fewer layers means fewer steps to transform the information successfully. Say I give a model with 5 layers, a text of 1000 words. Then, at most, it’ll be able to make five hops between pieces of information. Depending on how complex the text is, this might not be enough to understand everything fully.
This parallelism also means, qualitatively, LLMs process text differently than us. When you or I read a book, we do so serially – one sentence at a time. This takes a while. However, as we read, we’re also able to process and assimilate previous parts of the text. This means we build richer connections between the information we’ve read. Our network is dense. LLMs, though, process all this text in parallel. While they’re *much* quicker than us, they’re also limited by how deeply they can process it.
It’s like how you or I might glimpse at a painting. We’re able to absorb everything at once, and form a quick impression of what’s going on. We get a sense of the subject matter, the colors, the composition, what strikes us about the painting – all from one glimpse. Similarly, LLMs process text, especially large amounts of it, in an impressions-based, intuitive manner. They can quickly summarize & convey the gestalt of what’s going on. What’s tougher, is the details. This is partly why ‘reasoning models’ have been so successful. They give LLMs a ‘scratchpad’ to process & understand the information more deeply; a ‘system two’ to complement the default ‘system one’.
Intuitions for reasoning models, though, is a topic for another post.