As I mentioned in an earlier post (which I don't plan to post to LessWrong), I’m currently working in technical AI safety. Today, the main object of study of technical AI safety is the Transformer model, first discovered in 2017 by — say it with me — Vaswani et al. Different people have very different perspectives on how we should think about Transformers, and this motivates a lot of opinions about how AI will develop in the near-future, as well as how we should try to align AI today. I think some of these perspectives are good, and some are bad, but all are worth understanding to improve conversations around modern AI[1]. I’m going to describe —to the best of my ability —the major perspectives as I see them. I'm sure that there are more, but these are the main ones I want to discuss. If there's one I didn't include that you think is very informative, let me know in the comments!
I’ll start with three “black-box” perspectives. These perspectives don’t require much of an understanding of the internal structure of a Transformer model, instead viewing it as, well, a black-box. That is to say that they focus primarily on the input-output behavior of a Transformer — similar to the behaviorist perspective in psychology that was dominant for so long.
I’m beginning with a perspective that is at once the most common perspective, and also not really a perspective at all. The folk-perspective is mostly a grab-bag of intuitions from folk-understanding of previous technology such as e.g. ELIZA, or Amazon’s Alexa.
There is not really one unified “Folk perspective,” although there are some common hallmarks. I think there’s basically two canonical folk perspectives that are worth discussing — one that used to be more prevalent with early LLMs, especially among older individuals; and a new folk perspective that is on the rise today, especially with teenagers and zoomers.
This perspective gained prominence among those who tried out ChatGPT early on. However, this perspective never really gained traction with those in AI, due to often relying on misguided intuitions about LLMs.
The early folk-perspective is best characterized by references to explicit “choices” made by companies when it comes to responses given by an LLM, as though the company has control over what their model actually says in any particular context. For example, this folk perspective may look at xAI’s Grok when it went haywire on X and believe that Elon Musk made a purposeful choice for Grok to say what it did (i.e. inferring malice instead of negligence). For another example, people with this perspective may think that the left-wing bias of early LLMs was given to them explicitly by the companies that trained them — which it was not (to the best of my knowledge)[2].
Under this folk perspective, it also makes sense to try using LLMs to do relatively difficult arithmetic problems, believing that, since it is a “computer,” calculations must be easy for it. These days, ChatGPT will mostly get those calculations correct; however, in the past, ChatGPT would often get such questions wrong, especially if they required chaining multiple steps together. This is very surprising under the early folk-perspective. This would lead those holding this perspective to believe that there is some error with the way the models’ calculations work — or that the model was glitching in some way, as “computers should not get calculations wrong.” In truth, this is just a misunderstanding of the way that LLMs work.
Now that models are much better, there is a second folk perspective gaining traction, which leans much more heavily into anthropomorphism of LLMs. This perspective suggests that LLMs have desires, want to help, and are listening to you with rapt intention. I expect this folk perspective to grow over time, eventually likely outcompeting the early folk-perspective.
I also think this perspective is more defensible than the early folk-perspective in principle, but “misses the trees for the forest.” Moreover, this perspective — when taken to extremes — can lead to really bad outcomes, such as not checking work, and believing that the model’s advice is reliable and meaningful, similar to the advice you may be given by e.g. a human therapist, doctor, or friend.
This new folk-perspective is also much more “fragile” to learning information about how models actually work, similar to previous advances in AI. For example, a chess-playing AI initially seems like a computer that really understands chess. Someone without much background in AI may think this computer must be moderately intelligent. However, when you explain how it works in detail, e.g. by describing the Alpha-Beta pruning algorithm or Monte-Carlo Tree Search, people feel that the intelligence has been somehow deflated or explained away. I think this deflation — in a meaningful way — is less true for LLMs than it is for chess AIs, however people often react in a similar fashion when they learn how today’s models work, whether this is correct or not. They then tend to move to the next perspective in this list.
The next-token predictor perspective is an incredibly common one, especially on X/Twitter, among academics (i.e. Bluesky), and with those who know slightly more about how LLMs actually work than than those who hold either of the folk-perspectives[3]. The crux is that LLMs — being deep-learning models which output a likelihood for what the next word is going to be — are simply “predicting the next likeliest token.” This perspective supposes that LLMs are essentially performing an interpolation from all the data they’ve seen so far to try and determine what the most probable next token is — and not much more than that.
Usually, the implication of this is that LLMs are therefore unable to come up with anything novel. This is also backed up by experience. LLMs are much better at programming in Python than in e.g. OCaml, as the latter is a much rarer programming language in data sources online. The data distribution does seem to have a large effect on how good the model can be at certain tasks — exactly what this theory would predict!
There are, however, a few issues with this thesis as it is usually stated: that models are literally, or even approximately just doing next-token prediction. This is certainly a significant part of the training process of modern LLMs, but it is absolutely not the whole story. This is what a model that is literally just doing next token prediction does when asked a question:
“We’re asking capital city questions!” The model thinks, “I know how to do this!” (The model output is everything after “Model output.” It kept going for a while)
This is, of course, very different from what we would expect from ChatGPT. That’s because LLMs get additional training beyond just how to “predict the next-token.” This training comes in three stages.
First, they need to understand what a “chat context” is, which requires: the distinction between user and assistant; the fact that there will be messages from a user and an assistant that (mostly) alternate; the fact that the LLM itself is the assistant in the exchanges. So they are trained on chat contexts until they understand this — this is done via a training method called “supervised fine-tuning”[4].
Second, they need to understand that they should be a “helpful, harmless, and honest” chat assistant. This is done via RLHF[5] (Reinforcement Learning from Human Feedback). By the end of this process, we get a model like the original ChatGPT (which came out in late November of 2022).
Third, we put the model through “reasoning” training[6]. This is a somewhat novel innovation (September 2024), and it works similarly to RLHF, but attempts to get the model to be “smarter” instead of more “helpful, harmless, and honest.” This is what causes modern LLMs to say that they’re “thinking,” before they respond.
Hopefully, you can see why I’m not that sympathetic to the “next token predictor” perspective on LLMs. It is true that the majority of the compute used to train LLMs does go into training them to get good at next token prediction (for now), as this generally upper bounds how good the model can get after the later stages of training — so this perspective is not entirely unreasonable. However, it’s missing any description of the innovations that have brought LLMs to the attention of the very people who tend to hold this perspective.
There’s an alternative perspective that says that LLMs are actually mostly next-token predictors. However, this alternative perspective would say that the job of next-token prediction is actually incredibly difficult[7]! The fact that LLMs are able to predict next tokens as well as they do should astonish us, since the difficulty of being able to reliably predict data on the internet is highly non-trivial. Imagine, for example, that you were given a list of factorizable numbers, and then a list of their factors, as follows:
5744, 2, 2, 2, 2, 359
10201, 101, 101
...Predicting this text is going to be very difficult indeed, as it is believed that there is no polynomial-time algorithm that is able to factor numbers into their prime factors. Moreover, an LLM predicting this text would need to make its best guess in a single computation! That is, without the ability to “think carefully” before it outputs an answer.
This perspective claims that the type of intelligence which has gotten incredibly, marvelously good at next-token prediction is much more powerful than we would naively expect. This is not just because of “prime factorization” games like the one above (which certainly can be found on the internet). They also have some ability to model and predict the next word that e.g. Terence Tao is going to type on his blog. This indicates a high level of intelligence indeed. Even for the average Reddit poster, modelling them well enough to predict exactly what they’ll type next is not easy! This leads naturally to the next perspective.
This perspective posits that LLMs are not “doing language” similar to the way that humans do language[8], but are instead ‘mimicking’ language — analogously to how physics models mimic physics. A physics model will not get all the rules of fluid dynamics totally correct, but it will ensure that momentum is conserved, that energy is conserved, that fluids obey a rough approximation of the Navier-Stokes equation.
A language model, this perspective says, is similar. However, instead of conservation of momentum, the fundamental rules are more like:
Then LLMs can also, on this view, instantiate characters. Much in the same way that in principle a good-enough physics model could instantiate a human by accurately modelling what neurons would fire[9] , and what the human would proceed to say as a result. However, modelling characters is much easier for a language model, as the fundamental unit of reality for a language model is the linguistic token. Moreover, they need to be able to faithfully “simulate” characters in order to predict text effectively.
Then, when SFT (supervised fine-tuning) and RLHF (Reinforcement Learning from Human Feedback) are applied to the language model, the base model is molded into a simulation of a helpful AI assistant. Sometimes, this simulation decides to “go rogue” — based on previous examples of AI assistants going rogue (as occasionally, in the text they’ve been predicting, AI assistants go rogue, e.g. in science-fiction). So, this perspective says: the chatbot that you’re interacting with is a simulation of an AI assistant performed by a very alien “language simulator” that has no innate desires, wants, or needs[10]. This is captured well by the famous “Shoggoth meme”.
White-box perspectives are generally more mathematically flavored and require knowledge of the internals of a transformer. So, in order to understand these perspectives, it is necessary to know roughly what is going on inside a transformer. I will do my best to explain this quickly, but if you want more information, there are many good explainers elsewhere online. There are essentially three components of a transformer:
The MLP layers: An MLP is a fully connected network. It is essentially the traditional view of what AI is, as presented in this image below:
The Attention layers: This was the big innovation that made the Transformer so successful. Attention layers basically allow information to be passed forwards and backwards. They are quite complicated to explain in detail — at a high level, if I give a Transformer the sentence:
I saw a black swan and a grey goose. I fed the black…
Then the attention layers allow the model to move the information “the thing being described is a swan” from the word swan to the first occurrence of the word “black”[12]. Then when the model encounters the word “black” again, it can pass that information forward, so that it knows the next token should be “swan.” This is incredibly useful for language modelling in general.
The first perspective is the “mathematical framework” for Transformers that was established by Anthropic in their paper, which was published at the end of 2021. This view aimed to treat transformers as fundamentally “linear” models, with some annoying non-linearities added in. One important claim of this perspective is the importance of the residual stream. The residual stream is what all of the layers described above add their results to, and what the layers read from in order to perform their calculations. It’s not really a component of the transformer like the attention or MLP layers, it’s just how information moves from a previous layer to a future layer.
However, under this view, it’s one of the most important parts of the transformer — it is the “information highway,” along which all of the Transformers information and calculations get passed.
This view would state further that the layers of the Transformer, both the Multi-Layer Perceptron (MLP) layers and Attention layers are essentially performing “edits” to the residual stream in order to iteratively improve the model’s accuracy at predicting tokens, as you can see here:
However, the attention and MLP layers have different and complementary goals under this perspective. The MLP layers act as a “store of information,” so that if a model “remembers” a fact, such as “Paris is in France,” then this information will mostly lie somewhere in the weights of the MLP. The MLP also enables the model to do computations, so that if it needs to calculate the result of e.g. a mathematical expression, the actual calculation will mostly occur somewhere in an MLP (or distributed across multiple MLP layers). The attention layers then allow the model to pass information between different tokens as I described earlier, which also includes all the computations that an MLP may have done.
The naive version of this perspective was a hopeful one! It claimed that Transformers are, for the most part, a composition of a bunch of linear operations. Linear operations are generally not too difficult to understand and disentangle. So long as everything is represented linearly, we’ll be able to understand what’s going on inside a Transformer — sure, there were some issues with nonlinearities: the activation function[13], LayerNorm[14] — but those are just details.
It soon became clear there was a bigger issue.
Superposition is when models have to represent more things than they have dimensions or neurons[15]. This means that dimensions can’t correspond easily to “things the model is doing,” and poses a major challenge for interpreting what the model is doing. There are two types of superposition — “bottleneck superposition,” and “neuron superposition.”
Bottleneck superposition is intuitively not too difficult to understand. If there are 50,000 tokens in your vocabulary, but only 1000 dimensions, then it can’t be that each token is assigned its own dimension — there must be some “interference” between the embeddings of different tokens, just for storage. However, this issue is not too difficult to address. We just need to do the work of disentangling where different tokens — and information about these tokens — gets stored. This is doable.
The more difficult superposition is “neuron superposition.” This is when neurons (mostly in MLPs —though it has been observed in attention layers as well), actually do their computations in a distributed way[16]. This means that even if we managed to solve the issues with bottleneck superposition — doable, but certainly not an easy task by any means— we would still end up in a situation where we’re not sure how the model actually uses these concepts to compute things, since the computations are also all happening in superposition, and involve their own non-linearities.
Solving this issue has been the central organizing problem that those trying to understand Transformers have tried to address over the past three years. Progress has been made, and we’re definitely in a better place when it comes to understanding Transformers than we were, but it turns out that addressing superposition is much more difficult than we’d originally thought when the mathematical perspective was first established.
The final perspective on Transformers I’ll describe is a perspective on how they are trained, and how they get such impressive capabilities. This view departs strongly from the “next-token prediction” view of transformers, in favor of trying to explain both how they are so good at next-token prediction, and how they are good enough at generalizing to solve never-before-seen IMO problems.
Classically, in machine-learning, we are just trying to minimize our training objective — often called the “loss[17].” For Transformers during pre-training, this loss function is basically “How well did you predict what the next token would be?” During RLHF it would be “How well did your response comport to Human Values[18]”
The “energy minimization perspective” says that something else is going on too. Due to a combination of: gradient descent; the structure of our loss function; and the fact that there are symmetries within transformers. It claims that we’re implicitly also training with a “simplicity” prior. This means that early in training, the model focuses on minimizing loss by manually learning the rules of how to predict tokens. However, later in training, the main thing that affects the model’s learning is how “simple” or “generalizable” the model’s algorithm for predicting the next token is. This causes models to have a bias towards simple algorithms for predicting the next token — this enables for much more capacity to generalize[19] than we would naively expect under the “predict next token” framework.
This is called the “energy minimization perspective” because in Bayesian learning theory, what causes models to reach more simple and generalizable solutions is the fact that they are minimizing a quantity called the “free energy” of the system[20]. It has been proved that we can represent the free energy as basically a “loss minimization” term and a “simplicity” term (in the limit). The free energy perspective says that to really understand a transformer, we need to understand the effects of this simplicity term, as this is what allows them to be so powerful and generalize so effectively as we increase the amount of data we show them[21]. This perspective has spurred a lot of work in singular learning theory as applied to modern AI models.
This has been a pretty long post by my standards, so to conclude I’ll just give my takes on what perspectives I think are true. I think the simulator perspective, the superposition perspective, and the free energy perspective are basically true. The rest of them I think are either oversimplified (the mathematical perspective — though it was great for the time — and the next-token predictor perspectives) or just plain wrong (the folk-perspectives).
However, you don’t need to agree with me! I hope this post has left you in a more informed position to make up your own mind.
I’m hoping for this post to be a pretty accessible description of the major current perspectives on transformers. So I’ll warn that I’m going to elide some of the details of current training processes (which are actually incredibly complex nowadays) as well as, in the later section, eliding some of the mathematical detail. I’ll try and provide links to more information wherever possible.
Though they also probably didn’t work too hard to prevent it. But it wasn’t a conscious choice in the way that this perspective often posits.
Basically this:
The actual mechanics of how supervised fine-tuning works, especially in the chat context: We make sure during all of training there are some special tokens that are never encountered in pre-training. These tokens designate things like “This is a user message,” “This is an assistant message,” there are others, but let’s focus on the minimal example.
Then after the model has learnt how to predict text on the internet effectively, we give it a bunch of examples of “Chat histories” that involve these tokens and clarify to the model that it is the assistant. So, in this phase of training, the model never learns to predict the user’s message, it is trained only to predict the things that a chat assistant would say.
This training essentially works the same as pre-training, although during pre-training — because we do so much of it — we only run the model on the scraped internet once, since there are diminishing returns to doing it twice. The chat training examples are much smaller, so we can run the model on it multiple times, and often do. By the end of it, the model will understand that it should be acting the role of an assistant.
There’s other methods, but they’re generally conceptually quite similar to RLHF.
RLHF works as follows: we ask a lot of humans to provide preferences for which of the responses are better (via A/B comparisons). Then we can infer an “ordering” about which responses are better, and train a different model to predict how high a given response would come in that ordering.
We then get the LLM to generate a bunch of responses (since it now understands the chat context it should be in), and train it to increase the likelihood that a human would say “Yes, this a good response” and decrease the likelihood of “No, this is a bad response.”
This gets the model to think for longer, consider its response, and try and weigh all the possible choices until finally the model outputs an answer that is hopefully more accurate than if it didn’t think.
It works by getting the model to output a large number of tokens before answering, and models naturally will use those tokens to help them come up with their final answer. If their final answer to a question —usually a programming or mathematics problem — is correct, then we encourage those thought patterns. If it gets the question wrong, we discourage those thought patterns.
You can try it yourself here.
I.e. to communicate inner feelings, desires, wants.
Ignoring debates about physicalism.
However, the character that they create may have wants, desires, and needs. Much the same way as if we simulated a human in a physics model, they could have wants, desires, and needs, and be moral patients.
Okay a word is not necessarily a token and a token is not necessarily a word, but that is just unnecessary details for the most part. If, whenever you hear “token” you think “word,” you will be 95% correct.
I.e. so the model knows that the adjective “black” is modifying the word “swan.”
Activation functions are explained well here. They basically stop the model from being one big matrix multiplication.
LayerNorm is explained relatively well here, don’t worry about it too much though. It’s not that necessary to understand what Transformers are doing. We need LayerNorm for arcane training reasons, and in fact, we can remove it if we’re careful.
I know it sounds like some quantum woo thing. It’s not, they just chose superposition because you can never be certain which feature a certain vector corresponds to.
I know this is vague, but I really cannot go into more detail about this here. It would take very long to explain. There’s lots of good information about “computation in superposition” online though!
Is this loss?
By which, of course, we mean the values of workers paid below-minimum-wage to trawl through horrific model outputs somewhere in the Global South.
Since simple algorithms generalize better. This has been generally observed. It’s basically Occam’s razor.
The free energy is:
where is the number of datapoints, we integrate over all possible parameters, is the loss function for the weights, and is the prior probability of parameters — but don’t worry about it too much.
It says a lot of other things too, but much like the free-energy people, I’m going to approximate to first-order!