Great write-up, thanks Stephen! I really appreciate the effort this must have taken. I'm currently working through the ARENA curriculum, so reading your post was both a great refresher of earlier concepts as well as a good outlook on some of the things still ahead of me.
Some minor issues I came across:
Thanks again!
Thank you for taking the time to comment and for pointing out some errors in the post! Your attention to detail is impressive. I updated the post to reflect your feedback:
Good luck with the rest of the ARENA curriculum! Let me know if you come across anything else.
Since the release of ChatGPT in 2022, large language models (LLMs) based on the transformer architecture like ChatGPT, Gemini and Claude have transformed the world with their ability to produce high-quality, human-like text and more recently the ability to produce images and videos. Yet, behind this incredible capability lies a profound mystery: we don’t understand how these models work.
The reason is that LLMs aren't built like traditional software. A traditional program is designed by human programmers and written in explicit, human-readable code. But LLMs are different. Instead of being programmed, LLMs are automatically trained to predict the next word on vast amounts of internet text, growing a complex network of trillions of connections that enable them to perform tasks and understand language. This training process automatically creates emergent knowledge and abilities, but the resulting model is usually messy, complex and incomprehensible since the training process optimizes the model for performance but not interpretability or ease of understanding.
The field of mechanistic interpretability aims to study LLM models and reverse engineer the knowledge and algorithms they use to perform tasks, a process that is more like biology or neuroscience than computer science.
The goal of this post is to provide insights into how LLMs work using findings from the field of mechanistic interpretability.
Today the transformer, an ML model architecture introduced in 2017, is the most popular architecture for building LLMs. How a transformer LLM works depends on whether the model is generating text (inference) or learning from training data (training).
Usually when we interact with an LLM such as ChatGPT, it’s in inference mode. This means that the model is not learning and is optimized for outputting tokens as efficiently as possible. This process is autoregressive: the model generates one token at a time, appends it to the input, and uses the new, longer sequence to generate the next token.
Initially, the model takes a sequence of N tokens as input, and its first task is to predict the token that should come next at position N+1. To do this, it processes the entire prompt in parallel and produces a prediction for the next token at position N+1. This prediction takes the form of a probability distribution over the model's entire vocabulary, which can exceed 100,000 tokens. Every possible token is assigned a probability, indicating how likely it is to be selected for the next token. A sampling strategy, such as greedy decoding (which simply chooses the token with the highest probability), is then used to select a single token from this distribution. This newly selected token is appended to the sequence, and the entire process repeats to generate the next token, continuing the loop until the full response is formed.
The inference process is broken up into two steps: prefill and decode.
This two-phase approach is made possible by a crucial optimization called the KV cache. The cache acts as short-term memory, storing intermediate calculations (the keys and values from the self-attention mechanism) generated during the parallel prefill phase. In the decoding phase, the model accesses this cache to get the contextual information from all previous tokens without having to re-process them, solving what would otherwise be a major performance bottleneck of redundant computation.
The input and output tensor (matrix) shapes are different for the prefill and decode steps. Assuming a batch size of 1 for a single user interaction:
During training, the transformer produces N predictions, one for every token in the sentence. For each input position i, the output prediction at position i is the predicted token for position i + 1 (the next token). The ability to make multiple predictions increases training efficiency.
These predictions are compared to the actual words in the sentence and the prediction errors are used to update the parameters of the model and improve its performance.
In this section we will learn about the components that make up a transformer LLM rather than treating the whole model as a black box like the first section.
The following steps describe the sequence of events needed for a transformer to process an input sentence and output a new token.
Induction heads are a specialized attention head in transformer models that are considered a key mechanism for in-context learning. In-context learning is the ability of transformers to learn new tasks from multiple examples of them given in the prompt (few-shot prompting).
Induction heads implement the following algorithm: “If the token A is followed by the token B earlier in the context, when the token A is seen again, the induction head strongly predicts that B will come next.”
Note that induction heads can be considered an algorithm since they can detect patterns in arbitrary repeated sequences even when they were never in the training data.
An induction head is a specialized attention head in a transformer that excels at in-context learning by detecting and completing patterns it has already seen in the prompt.
The induction circuit consists of two heads: a previous token head in the first layer and an induction head in the second layer:
Insight: induction heads are evidence that LLMs can learn algorithms rather than just memorizing data. Since induction heads can only form in transformers that have at least two layers, this is evidence that larger models have qualitatively different algorithms.
The off-center diagonal purple stripe in the triangular attention pattern (figure 7) is the result of induction heads. Each token in the repeated second sentence strongly attends to the next token of the sequence from the first sentence.
Another way to understand attention is by understanding how a task called indirect object identification (IOI) is implemented using attention heads.
Given a sentence like “When Mary and John went to the store, John gave a drink to” the answer is “Mary” and this task is called indirect object identification (IOI).
In 2022, researchers at Redwood Research reverse engineered a detailed circuit for implementing this task that involves several different types of attention heads.
The circuit implements a three-step process:
This algorithm is carried out by three main groups of specialized attention heads working in sequence:
The MLP (multi-layer perceptron) sub-layer is applied after the attention sub-layer in a transformer block. Each MLP sub-layer is a standard two-layer feed-forward neural network with two weight matrices and one activation function and can be written mathematically as:
During both training and the initial pre-fill step of inference, the MLP layer processes all token vectors in parallel via matrix multiplication. However, in the subsequent decode step of the inference process, where new tokens are generated one by one, the MLP is applied only to the single newest token to help predict the next one.
About two thirds of the parameters in a transformer are in the MLP layers. The other third of parameters can be found in the attention, embedding and un-embedding layers.
One simple intuition for what MLPs do is that they store the model’s knowledge it needs to predict words and answer questions. For example, knowing that the sentence “The Eiffel Tower is in the city of” should be followed by “Paris” requires knowledge about the relationship between the Eifel tower and the city of Paris.
The paper Transformer Feed-Forward Layers Are Key-Value Memories describes transformer MLPs as key-value memories where the first weight matrix corresponds to the keys and the second weight matrix to the values. Note that this terminology should not be confused with the same terminology that is used in the attention mechanism that describes a completely different process.
1. Keys as pattern detectors: The first matrix in an MLP layer can be thought of as a collection of key vectors. Each key is trained to act as a pattern detector, activating when it encounters specific types of text in the input. An input vector from the residual stream is multiplied by all the keys to produce "memory coefficients" which indicate how strongly each pattern has been detected.
These patterns are often human-interpretable and range from simple to complex:
2. Values as next-token predictors: Corresponding to each key is a value vector stored in the second MLP matrix. Each value vector effectively holds a probability distribution over the model's entire vocabulary. The distribution for the value represents the tokens that are most likely to appear immediately following the pattern detected by its corresponding key.
The output of the FFN layer for a given input is the weighted sum of all its value vectors, where the weights are determined by the activation of the keys:
Concrete example: given the sentence “Stay with you for a”, there is a key k2 that is triggered by inputs that describe a period of time and end with the letter “a”. Its corresponding value vector v2 represents a probability distribution that puts most of its probability mass on the word “while”. The input sentence strongly activates key k2 which triggers value v2 and increases the probability of the word “while” in the output distribution.
This paper indicates that a transformer's MLP (or feed-forward) layers, which constitute the majority of its parameters, function as a vast repository of key-value memories.
Insight: A model with more MLP parameters has a larger number of these key-value "memories." This increased capacity allows it to store a more extensive library of linguistic rules and semantic relationships, which can be interpreted as having more "knowledge of the world." Like many deep learning models and probably the human brain, this knowledge apparently involves hierarchical processing with shallower syntactic patterns forming the foundation for more complex semantic understanding in deeper layers of the network.
The LayerNorm (layer normalization) step normalizes the activations of each layer to have zero mean and a variance of one. Layer normalization is applied before each attention and MLP sub-layer and also before the unembedding layer. This is important for stable training using gradient descent though it’s not that relevant for understanding how the model works.
LLMs are trained to predict the next word in a large corpus of internet text. For each batch of text, the following steps occur:
At first the transformer LLM’s weights are random and the model makes random predictions resulting in a high loss. However, as the model is trained and its weights are updated, the loss rapidly decreases as its ability to model language and predict the next word in the text corpus improves.
Transformers are trained to minimize a loss function, typically cross-entropy loss, which measures how different the model's predicted probability distribution for the next token is from the actual next token in the training data. By observing the loss curve over time, we can gain insights into the algorithms the model acquires.
The transformer’s training run has several phases:
3. Learning induction heads and more advanced algorithms: After mastering simple frequencies, the improvements become more gradual. The model must learn more sophisticated, long-range dependencies and abstract rules. This is where complex circuits, like induction circuits, begin to form. The emergence of induction heads can cause a noticeable “bump” or sudden drop in the loss curve, as the model suddenly gains a new, powerful capability for in-context learning. This phase of training and beyond is where the model moves beyond simple statistics to generalizable algorithmic reasoning.
Insight: although the loss decrease is a quantitative difference, qualitatively different algorithms can form (e.g. induction heads are very different to n-grams) as the loss decreases.
While the loss curve gives us a high-level view, recent research drawing on Singular Learning Theory (SLT) provides a more principled way to identify these developmental phases. This framework uses a metric called the Local Learning Coefficient (LLC) to quantify the “degeneracy” of the loss landscape.
Degeneracy refers to the presence of redundant parameters which are directions in the weight space that can be changed without affecting the model’s loss. The LLC can be understood as an inverse measure of this degeneracy, or more intuitively, as a measure of model complexity:
Researchers have shown that the distinct phases of learning described above coincide with significant changes and increases in the LLC:
Insight: Although the decreasing loss curve looks like a smooth continuous decline, research on SLT shows that the training process is actually made up of distinct phases separated by phase transitions where the model fundamentally changes how it processes information in each phase.
Another phenomenon that offers insights about the training dynamics of LLMs is grokking. Grokking occurs when a neural network suddenly and rapidly learns to generalize to unseen data after a long period of memorization and overfitting.
By studying a one-layer transformer trained on modular addition, researchers identified three distinct phases of training that lead to grokking:
Insight: LLMs can memorize data they were trained on and later form generalizing algorithms. This makes sense since a generalizing solution achieves a lower loss than a naive memorization approach.
Sparse auto-encoders are a novel and powerful technique for understanding the concepts used internally by LLMs.
To see why SAEs are so valuable, we first need to understand the problem of superposition.
LLMs exhibit a phenomenon called ‘superposition’ meaning the model learns many more features than it has neurons. By features we mean patterns in the input or concepts that the model might learn to detect. Examples of features:
For example, the residual stream of an LLM might have 8192 dimensions or neurons meaning that each vector in the residual stream is composed of 8192 numbers. However, real-world text is complex and there could be tens of thousands of hundreds or thousands of useful features needed to understand it.
When there are far more learned features than neurons, neurons must be polysemantic meaning each neuron must learn to respond to multiple different concepts. Additionally, each feature is processed by a weighted sum of multiple neurons. In other words, there is a complex many-to-many relationship between neurons and features.
How can a single neuron learn and process multiple features? One problem with polysemantic neurons is interference among the different learned features meaning that the activation of one feature can modify how other features are processed. An analogy is a noisy room with multiple people where the noise makes it difficult to understand what someone else is saying.
Polysemantic neurons mostly avoid interference among multiple features by exploiting the fact that real-world training data is usually sparse: this means that it’s unlikely for multiple unrelated features to be active at the same time. For example, if an Arabic feature is activated because of an Arabic text input, it’s unlikely that a Chinese feature is also activating on the same text.
If the data were not sparse, polysemanticity would cause interference and the neural network would instead assign a feature to each neuron, only learn the most important features, and ignore the rest of them. However, real-world text is usually sparse which implies that superposition and polysemanticity are common in LLMs.
Polysemantic neurons are a problem because they are difficult to interpret: since each neuron could respond to multiple unrelated features, it’s difficult to identify the role of each neuron and reverse engineer the neural network.
The superposition hypothesis postulates that neural networks “want to represent more features than they have neurons” and that neural networks in superposition are noisy simulations of much larger sparser neural networks. A typical activation vector such as the residual stream is not interpretable since it’s holding many features in superposition and its neurons are typically polysemantic.
The goal of sparse autoencoders (SAEs) is to learn a much larger sparse activation vector composed of monosemantic features. The goal is for the SAE’s features to have the following interpretable properties:
The SAE is a neural network composed of two layers: the encoder and the decoder.
Given an input x with length M, the SAE tries to reconstruct the input vector using a weighted sum of learned feature direction vectors.
The encoder produces a sparse vector of coefficients which describe how active each feature is when processing an input token:
The decoder reconstructs the original input vector using a weighted sum of the coefficients and the learned feature vectors:
The encoder weight matrix has dimensions (F x M) and the decoder weight matrix has dimensions (M x F) where M is the dimension of the residual stream and F is the number of learned features in the SAE (typically F is ~10x larger than D). The feature directions are the columns of the decoder weight matrix and therefore there are F learned feature directions.
The loss function of the autoencoder aims to minimize the reconstruction loss between the original vector x and the reconstructed input vector while also maximizing the sparsity of the hidden coefficients vector (output of the encoder) and there is a trade-off between these two goals.
A key problem is that we need to identify which feature each feature vector corresponds to. Another problem is that SAEs often learn thousands of features so it’s not easy to do manually. Fortunately, it’s possible to automatically label each feature vector using an LLM. Here’s how it works:
Other ways to understand a feature include:
What kind of insights have SAEs provided about LLMs? In the Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet paper, researchers from Anthropic discovered the following insights:
SAEs are useful for identifying what concepts are used internally by an LLM but they don’t provide a full picture of how LLMs perform tasks. Fortunately, Anthropic’s recent research on Circuit Tracing offers explanations of how LLMs perform tasks in terms of “circuits”: graphs that show the flow of information between learned features across different tokens and layers.
Circuit tracing uses a technique called transcoders rather than sparse autoencoders which are similar but different in important ways:
The goal of transcoders is to replace uninterpretable MLP blocks with transcoders composed of sparse and interpretable monosemantic features.
For circuit tracing, the researchers used cross-layer transcoders: each transcoder reads from the residual stream at layer L and writes to all subsequent layers. The researchers used cross-layer transcoders instead of per-layer transcoders because they found that they achieved better performance on metrics like mean squared error (MSE) though either could be used.
Like SAEs, the goal is to train the transcoder to minimize reconstruction loss while also maximizing feature sparsity and transcoder features can be labelled based on the texts that highly activate them and the output logits they increase or decrease when activated.
Attribution graphs explain the output of the model in terms of a sub-graph of interconnected transcoder features that are activated and responsible for a model’s output.
Attribution graphs can be created from a “local interpretable model”, an interpretable model that replaces the original uninterpretable model and produces identical outputs given an input prompt p. In the local interpretable model, the MLP blocks are substituted with cross-layer transcoders. Additionally, the attention pattern from the original model is frozen and an error term is added to each layer-position coordinate to correct the difference between each MLP block and transcoder.
The interpretable local replacement model can then be used to create an attribution graph by pruning unimportant paths through the model so that the model can be explained by only a few transcoder features and connections between them. In addition, similar feature nodes may be merged into supernodes to reduce the number of features.
At a high level, the attribution graphs are composed of nodes and edges. There are four types of nodes in the graph:
An edge in the attribution graph quantifies how much one feature influences another in a later layer or position. A strong edge indicates that the activation of one feature is a significant causal factor for another. The calculation for an edge's strength depends on whether the connection is between two transcoder features or is mediated by an attention head. The flow of information in the attribution graph is up and to the right, as information flows from earlier to later layers and from earlier to later token positions. Note that the flow of information from earlier to later positions requires attention.
A key challenge when creating attribution graphs is ensuring they faithfully reflect the inner workings of the original model. To validate these graphs, researchers applied specific perturbations, such as suppressing the activation of certain features, and then checked if these interventions produced the expected effect on the model’s output.
For example, in a prompt like “the capital of the state containing Dallas is,” the unmodified local replacement model correctly outputs “Austin.” The attribution graph for this behavior shows that the “capital,” “state” and “Dallas” features are activated. Suppressing the “Dallas” feature causes the model to output random US state capitals like “Albany” or “Sacramento” which confirms that the “Dallas” feature specifically causes the model to output the capital of Texas, rather than any state capital.
Although LLMs are only trained to predict internet text, they have the ability to add numbers but how?
One simple approach to find out is to ask an LLM to explain its reasoning. For example, I asked Claude 3.5 Haiku “Answer in one word. What is 36+59?” and “Briefly, how did you get that?” and its response was “I added the digits: 6+9=15 (write 5, carry 1), then 3+5+1=9, giving 95.”
It sounds plausible but this response is really just a high probability response to the prompt rather than a faithful description of how the LLM really adds numbers. Fortunately, circuit analysis can offer some insights into how LLMs add numbers.
The two attribution graphs in this section show the features active at each position and layer for the prompt “calc: 36 + 59 = “. The first attribution graph is the original attribution graph and the second one is a simplified one showing the main supernodes in the graph.
We can understand the attribution graph by starting with the answer node “95” and working backwards. First it's important to understand the difference between input and output features:
Three separate pathways contribute to the final answer:
The overall graph is fairly complex but the basic idea is this:
In conclusion, how does addition in LLMs work? One paper on the subject offers a succinct high-level explanation:
“Do LLMs rely on a robust algorithm or on memorization to solve arithmetic tasks? Our analysis suggests that the mechanism behind the arithmetic abilities of LLMs is somewhere in the middle: LLMs implement a bag of heuristics—a combination of many memorized rules—to perform arithmetic reasoning.”
In conclusion, I would like to try and answer the high-level question “How do LLMs work?”.
The first sections of the post focused on describing the high-level components of a transformer such as the embedding and unembedding layers, transformer blocks, and the attention and MLP sub-layers.
At a high level, we know that LLMs use previous input tokens to predict the next token which involves outputting a probability distribution over all possible tokens in the vocabulary. Although knowing about the basic behavior and components of a transformer is useful, just this knowledge alone does not allow us to explain how LLMs work since each component is by default mostly a black box.
The middle and later sections of this post explore the components of a transformer in more depth with more detailed explanations of attention, MLP blocks, LLM training dynamics, superposition, and finding interpretable features using SAEs. We’ve also covered specific case studies of empirically observed phenomena such as induction heads, indirect object identification, SAE features, and using transcoder circuit tracing to understand addition.
Before offering an explanation of how LLMs work, it’s useful to first consider some common but imperfect analogies that could be used to explain LLMs:
So, what is a more accurate high-level explanation? From what I’ve read, identifying circuits, specialized sub-networks of attention heads and MLP neurons that work together, is the best explanation for how LLMs perform high-level tasks.
During its training, the model learns to do several distinct things that are useful for predicting the next word and performing tasks:
These different mechanisms are combined into complex circuits to execute sophisticated, high-level behaviors such as indirect object identification, addition, factual recall and others.
Special thanks to the ARENA course, whose content was useful for writing this post.
Note that the explanation of the transformer architecture in this post is based on a design similar to GPT-2 (2019) which covers the core components of the transformer architecture without the added complexity of the many performance and efficiency optimizations that have been introduced since.
Much of the research described in this post was carried out on older models like GPT-2 and an important question is whether this research is and will continue to be relevant to understanding modern LLMs. I believe the answer is yes, because while many optimizations have been added, the foundational principles of the transformer architecture have remained consistent.
Here is a summary of the post From GPT-2 to gpt-oss: Analyzing the Architectural Advances that describes the key differences between the older GPT-2 model and a GPT-OSS, a modern one:
Despite the differences, in 2025 the transformer architecture and its core components such as the embedding, self-attention and feed-forward layers are still used and therefore I think any interpretability research on transformers (introduced in 2017) is still relevant.
If the transformer is replaced by a new architecture, then I think some of this post’s content such as the focus on attention heads would no longer be relevant. That said, modern LLMs are just another deep learning architecture and the insights about fully connected neural networks (MLP blocks) features, superposition, circuits, and training dynamics seem more timeless and I believe they will still be useful in future AI architectures beyond transformers just as they were relevant in pre-transformer architectures such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs).
Post sources:
Interpretability tools:
Related posts and further reading: