This is the write-up for our (@cozyfractal and mine) capstone project during ARENA's 2023 summer iteration. Our project explored a novel approach for interpreting language models, focusing on understanding their internal flow of information. While the practical implementation was completed in just one week and lacks formal rigor, we believe it offers some interesting insights and holds promise as a foundation for future research in this area. The accompanying repository with code examples and more experiments can be found here

We want to thank Alexandre Variengien whose original idea served as the inspiration for this work, and who provided extensive feedback as well as thought-provoking discussions. Additionally, we want to express our thanks to the organizers of ARENA and our fellow participants for fostering an environment that encouraged productive collaboration and learning.


Imagine a transformer tackling a lengthy prompt, like a code snippet. Within this prompt, there are subsets spanning multiple tokens, forming semantic units like function and variable definitions. But how does the transformer handle this distributed information while calculating subsequent token probabilities? Do specific attention heads copy information from all tokens in a function and aggregate it at the end? Or does aggregation occur earlier, as has been suggested by recent work on factual recall in large language models (LLMs). Perhaps this involves tokens near the end of a semantic unit, or seemingly unrelated tokens like punctuation signs acting as "intermediate variables". Another interesting question is to what degree the distribution of information over tokens is redundant, allowing the network to make correct predictions even after removing some individual tokens, like a bracket from a function definition.

These dynamics - involving intermediate variables, aggregation, and distributed information - are examples of what we term the "information flow" through an LLM. Our project's objective was to develop a tool that sheds light on this intricate process. By comprehending the inner workings of LLMs as they process information, especially during impressive feats like chain-of-thought reasoning, we hope to gain insights into long-range context processing. Understanding how information is moved over large token-distances is crucial for AI capabilities like long-term planning and deception, which are significant from a safety perspective. Knowing the information flow requirements for such capabilities could aid in their detection, and thus help with AI alignment.

1. Project Overview

In a nutshell, our project can be divided into three main parts. First, we devised interventions to apply while a model processes input, allowing us to determine the "connectedness" between tokens or groups of tokens. By connectedness, we mean understanding how important the information that is moved between two tokens or two groups of tokens is for specific tasks, like indirect-object identification, that information is transmitted between token positions through attention heads. With these interventions, we created "information-flow graphs" with arrows to visualize the connectedness between tokens and their significance. Here's an example of such a graph for the input "When Mary and John went to the store, John gave a book to" (a more detailed explanation of the diagrams is given below).

Example of an information-flow diagram. Green arrows mean the connection makes the model better at the IOI task and red arrows mean the connection makes the model worse.

Next, we implemented optimizations to apply these interventions at scale. Our ultimate goal was to build a tool that enables us to grasp how large-scale models with billions of parameters process extensive input sequences. Clearly, some interventions, such as path patching, would be impractical to perform for every pair of components in such a context. Finally, to investigate the reliability of our tool, we cross-referenced our findings with existing knowledge about problems like indirect-object identification and docstring completion. 

We ended up doing a lot of exploratory analysis with various models, prompts, interventions, and optimization methods. To keep the scope of this post manageable, we will only present a subset of them. You can find more experiments, such as an evaluation on the docstring-completion task, in the notebooks in our repository.  

1.1 Measuring "interaction" between tokens via interventions.

We devised a variety of interventions to assess the connectedness of token pairs. To gauge the significance of information flow between token positions, we examined how relevant logit differences changed when applying these interventions. For instance, in the context of the Indirect-Object Identification (IOI) task, the connection between the S2 and END tokens plays a crucial role. Hence, if we intervene on the connection between those tokens, we would expect the logit difference between the subject and object names to decrease.

"Zero Pattern" intervention - knocking out attention between tokens

One of the simplest and most direct interventions we employ is to completely disable attention between certain token positions. In this approach, for each attention head, we set  (representing the attention paid from token position  to ) to  for the positions i and  between which we want to knock out attention. It's worth noting that we apply this intervention after calculating the attention patterns, which means it occurs after the softmax operation is applied to the attention scores.

"Corruption" Intervention - patching in attention from a corrupted prompt

By knocking out attention, we essentially render token position  less aware about position . However, we also explored a different approach - what if we made  to see information from a different prompt at position ? This concept gave rise to the "corruption intervention," where we take the information flow from  to  for some corrupted prompt and patch it into a forward pass with our clean prompt. Essentially, we performed a form of path patching as we specifically patch in the information flowing between these two components. A formal definition of the patching strategy used for the corruption intervention can be found in appendix A.

Further Interventions

In addition to the zero pattern and corruption interventions, we also implemented two other types of interventions, which we evaluated using a similar approach. However, to keep the scope of this post manageable, we won't delve into the details of these interventions or their specific evaluation results here. A brief description of the interventions can be found in appendix B. 

1.2 Optimizing the computation of interactions.

To efficiently evaluate the result of our interventions for large models and long prompts, we devised several computational strategies. The key idea behind these strategies is to employ heuristics that allow us to skip interventions for components where we have strong grounds to believe that their connection will be weak.

Most of these optimization strategies are not relevant to the results we will show, and so we include their descriptions only in appendix C. However, one optimization strategy that is worth briefly mentioning is the "splitting" strategy, as we use it in the final section of our analysis, where we investigate the information flow inside model processing a long prompt. The idea behind the splitting strategy involves partitioning the input based on higher-level structures, such as sentences or paragraphs, and calculating the connection strength between these partitions, rather than focusing on individual tokens.

2. Evaluation and Results

Our metrics are intended to capture the importance of connections for a particular task. However, some of our interventions, such as knocking out attention between many residual stream positions, are quite blunt. Therefore, when we see a substantial impact due to an intervention, what has really happened lies somewhere between two extremes. On one hand, the intervention might precisely remove the critical information flow necessary for a specific task, akin to a surgical operation. On the other hand, it might result in a "hemorrhage" within the model, completely eroding its ability to reason effectively. To gain a clearer understanding of what it means if an intervention indicates a connection, we began by evaluating them on tasks where some ground truth about how an LLM tackles those tasks is already known.

2.1 Evaluation with the IOI task on GPT2-small

In our evaluations, we primarily focus on using the prompt "When Mary and John went to the store, John gave a book to" and measure the logit difference (LD) between " John" and " Mary". This approach allows us to identify both important connections, where the LD decreases upon their removal (shown in green), and harmful connections that lead to an LD increase when removed (shown in red). We show the ten connections with the highest absolute LD values.

Based on our knowledge of the Indirect-Object Identification (IOI) circuit, we anticipate the following connections:

  • A positive connection from the IO token (" Mary") to the END token (" to"). We expect this connection because for GPT2 to predict the IO as the next token, the name-mover heads must copy its value to the END position. Without information flow from IO to END, such prediction would not be possible, and hence it should show up for any reasonable intervention.
  •  A positive connection from S1 (" John") to S1+1 (" went"). This connection is significant as a previous-token head copies the first occurrence of the subject name to the next position (S1+1). Subsequently, an induction-head can send the value to an S-inhibition head, which reads from the S2 position (second occurrence of " John"). Thus, the connection between S1 and S1+1 is important for determining which name not to predict.
  • A positive connection from S1 to S2. Here, duplicate token heads copy the subject name directly from S1 to S2, where they are read by the S-inhibition head. This provides another way for the circuit to know which name not to predict.
  • A positive connection from S1+1 to S2. As explained earlier, once the subject name has been copied to S1+1, an induction head moves it from there to S2, where the S-inhibition head reads it.
  • A positive connection from S2 to END. This arrow represents the work of the S-inhibition head, which transfers information from S2 to END, enabling the model to know which name not to predict there.

We can sanity-check our proposed interventions, by verifying that these expected connections actually show up in the resulting information-flow graphs.

Result of the Zero Pattern Intervention

Here are some of the key findings from visualizing the 10 most important connections based on absolute LD value, obtained through the zero pattern intervention applied to every pair of tokens:

  • The majority of the expected arrows are present in the graph, indicating that our method successfully captures important connections related to the IOI task. For the only expected arrow that is not shown, from S1 to S2, a corresponding positive connection was also found, though its absolute LD value was only .
  • Interestingly, there is a red arrow from S1 to END, implying that a name-mover head could be copying the subject name from S1 to END, leading to an increase in the " John" logits at the END position.
  • The unexpected arrow from " gave" to END could suggest that the model might be performing a similar action at " gave" as it does at END. For sentences of the form "IO ... S1 ... S2 gave," the IO token may be a reasonable completion, similar to the completion for "IO ... S1 ... S2 gave ... to."
  • There are a mysterious connection from the "When" and IO positions to " and". This may be due to the fact that " and" is only the third token in the sequence, and completely knocking out attention to one of the earlier tokens could cause a "hemorrhage" in the model, leading to unexpected connections appearing to be important.
  • Two other strong connections that we did not expect and that may be due to hemorrhage are  " and" -> S1, " and" -> S2, and "gave" -> END. 

This result shows that the zero pattern intervention is capable of finding most of the important connection for the IOI task. Compared to other interpretability methods, like automated circuit discovery, the information-flow graph we generated has the advantage of allowing us to observe both positive and negative connections. It is that we can see negative connections, such as the one between S1 and END as well as positive ones. However, we also see that there are some arrows in the graph which are likely due to hemorrhage and may be confusing to a practitioner. It is thus important that the possibility of  hemorrhage is kept in mind and that results are, if possible, cross-evaluated with other interventions.

Corruption Intervention

When evaluating the corruption intervention, the results depend on how we choose to corrupt the prompt. In this case, we intervened with a corrupted prompt where the names were changed to three random distinct names: "When Felix and Sarah went to the store, Diego gave a book to". Notably, the total number of tokens in the corrupted prompt remains the same as in the clean prompt.

The graph obtained from this corruption intervention aligns with our expectations, and it presents several interesting findings:

  • We don't see any of the peculiar arrows observed in zero pattern intervention such as from " When" to "  and" or " and" to " John", so we expect them to correspond to an hemorrhage, an indication that the model breaks when doing the intervention. 
  • The name nodes in the graph have green arrows going into themselves. This is expected as we are changing what a token sees at its own position.
  • There is now a peculiar arrow going into " gave"  instead of out of it. See the next section for a possible explanation.

Overall, this results indicates that the corruption intervention is closer to the expected ground truth than the zero pattern intervention. However, it is important to note that achieving these more accurate results with the corruption intervention comes with the challenge of generating a suitable corrupted prompt. Selecting an appropriate prompt that accurately captures the desired changes while maintaining the same number of tokens as the clean prompt can be a non-trivial task, especially for larger prompts with complex structures.

Conclusion and Consistent Findings Across All Interventions

The evaluation results from both interventions demonstrate that our methods are capable of identifying connections that align with our understanding of the Indirect-Object Identification (IOI) circuit. While the expected connections are not always among the most crucial ones, our methods still manage to capture valuable ground truth about the underlying task's circuit.

However, it is crucial to acknowledge that both interventions can yield some misleading arrows, potentially caused by hemorrhage. Notable examples include arrows from "When" to "and" and from name tokens into themselves.

Interestingly, there are certain consistent connections that emerged across all interventions, which were not predicted solely from our knowledge of the IOI circuit:

  • The negative connection from S1 to END is likely the result of a name-mover head. It is worth noting that detecting negative connections is unsurprising when using our methods, whereas automated circuit detection, which focuses on components that positively contribute to a task, might not reveal such connections.
  • The token "gave" consistently appears as an important node, but the direction of the arrow going into or out of it varies. Given its position immediately after S2, it is plausible that some information is copied into "gave" from S2 and then moved to the END position, providing an alternative route for communicating the subject name.

These consistent connections offer valuable insights into the model's internal workings and suggest the presence of alternative paths and information flow patterns that contribute to the IOI task. Overall, while our interventions provide valuable information about the IOI circuit and the model's reasoning process, they also emphasize the need for careful interpretation and consideration of potential pitfalls.

2.2 Code Completion with Pythia 2.8B 

In this section, we explore the application of our methods to a different task on a larger language model. The main purpose of our tool is to analyze information flow in LLMs as they process complex input, particularly with longer prompts. We suspect that longer prompts could reveal interesting patterns, such as the use of specific tokens (e.g., punctuation marks) as intermediate variables to move information across significant token-distances. Our method aims to capture and exhibit such structures.

To test this hypothesis, we designed a code completion task that challenges the model to combine information scattered across a substantial chunk of Python code. The task requires the model to predict the correct variable to pass to a function call based on the type signature given in the function definition and the class shown during variable initialization. In this task, the variable and class names have no meaningful semantics, preventing the model from relying solely on variable names to make predictions. For example, when completing the code snippet provided below, the correct completion is " bar" since it is of type B, which is required by the function definition of calculate_circumference. Simply inferring from variable names like circle would not be sufficient for the model to predict the correct token.

We selected Pythia-2.8B for this task as even Pythia-1.3B struggled to reliably predict the correct token. For the model to perform well, it must effectively move information from the function definition and correct variable to the end position where the completion is expected. This could entail a considerable amount of information being moved and aggregated at the end, leading to numerous arrows from various nodes into the end node. Alternatively, the information might be aggregated in distinct chunks using intermediate variables. We expect that by applying our methods to this task, we will gain insights into how the model handles and processes complex information spread across long sequences of code.

from typing import List
from math import pi

class Point:
	def __init__(self, x: float, y: float) -> None:
	self.x = x
	self.y = y

class A:
	def __init__(self, bottom_left: Point, top_right: Point) -> None:
	self.bottom_left = bottom_left
	self.top_right = top_right

class B:
	def __init__(self, center: Point, radius: float) -> None: = center
	self.radius = radius

class C:
	def __init__(self, points: List[Point]) -> None:
	self.points = points

def calculate_area(rectangle: A) -> float:
	height = rectangle.top_right.y - rectangle.bottom_left.y
	width = rectangle.top_right.x - rectangle.bottom_left.x
	return height * width

def calculate_center(rectangle: A) -> Point:
	center_x = (rectangle.bottom_left.x + rectangle.top_right.x) / 2
	center_y = (rectangle.bottom_left.y + rectangle.top_right.y) / 2
	return Point(center_x, center_y)

def calculate_distance(point1: Point, point2: Point) -> float:
	return ((point2.x - point1.x) ** 2 + (point2.y - point1.y) ** 2) ** 0.5

def calculate_circumference(circle: B) -> float:
	return 2 * pi * circle.radius

def calculate_circle_area(circle: B) -> float:
	return pi * (circle.radius ** 2)

def calculate_perimeter(polygon: C) -> float:
	perimeter = 0
	points = polygon.points + [polygon.points[0]] # Add the first point at the end for a closed shape
	for i in range(len(points) - 1):
		perimeter += calculate_distance(points[i], points[i + 1])
	return perimeter

foo = A(Point(2, 3), Point(6, 5))

bar = B(Point(0, 0), 5)

name = C([Point(0, 0), Point(1, 0), Point(0, 1)])

# Calculate circumference

Complete Information-Flow Graph

In practice, we advise against calculating the whole information-flow graph for such a large model as this is extremely computationally expensive. Instead, we recommend using one of the optimization strategies we implemented. However, for pedagogical purposes and to establish a baseline for comparison with our optimization strategies, we decided to calculate the complete information-flow graph for the aforementioned prompt once, using the zero-pattern intervention and no optimization strategy. A cropped version of the result is depicted below (the full diagram contains many more individual nodes and nodes with arrows from the first token).

This demonstrates how our information-flow diagrams become much harder to interpret as the number of tokens increases, which is why we usually only show the 10 most important connections. In the complete graph, we observe several inexplicable arrows originating from the first newline token and extending towards later tokens. While we cannot fully explain these findings, we suspect that they may be a result of hemorrhage, causing unintended and confusing information flow patterns. This aspect does not entirely fit with the hypothesis we came up with in section 2.1, based on which we anticipated misleading arrows when cutting out attention between very early tokens. However, some arrows extend far beyond what we would anticipate, such as reaching the newline at position 564.

Despite the confusing and inexplicable results, a significant portion of the graph does align with our expectations. Notably, we can observe two major clusters around the definitions of precisely the function and variable that are relevant for the code completion task. These clusters reflect the model's information aggregation around critical components, demonstrating its ability to focus on relevant regions of the code when making predictions.

A cluster of connected token positions, corresponding to the definition of the variable "bar".
A different cluster, corresponding to the definition of the calculate_circumference function.

However, it is worth pointing out that the connections between many of the nodes within these clusters are relatively weak. If we had followed our usual approach of looking at only the 10 most important connections, we might have missed these subtle connections. One possible explanation is that none of the individual tokens are particularly crucial on their own. Skilled Python programmers could likely solve the code completion task even with a few gaps in the function definition, so it is not surprising that knocking out attention to most individual tokens does not significantly harm Pythia's ability to perform the task. The only exception, of course, is the name of the variable, which exhibits by far the most critical connection towards the end position.

Newline-Split Information-Flow Graph

After observing the clustering of information flow in our previous analysis, where information about structures such as function or variable definitions seemed to be aggregated locally, we decided to apply the splitting strategy to investigate information flow between chunks of text. We obtain the following diagram by splitting at newlines:

In this information flow diagram, a node represents a range of tokens, with the numbers of the left indicating the position of the start and end token. So the node beginning with 522:535 represents the range of tokens from positions 522 to 535.

In this graph, we can observe some interesting features:

  • By focusing on the 10 strongest connections, there is now a clearly visible information flow from the relevant definitions to the function call. This result is in line with our expectations, as it highlights how the model effectively combines information from function and variable definitions to make accurate predictions for the code completion task.
  • The fact that the connections go directly from the definitions to the function call suggests that information is aggregated close to the relevant semantic units, and not necessarily in arbitrary tokens like punctuation signs. 
  • Furthermore, the information appears to be distributed among tokens. When examining individual connections within important clusters, we noticed that many connections had a relatively low importance, typically just over 0.1. Moreover, information seems to be highly distributed among tokens. Due to this, in the complete information-flow graph, connections that are part of an important cluster often did not stand out compared to those that are not.

Alas, not everything about this diagram is easily interpretable. It is not clear why there is an important connection from definition of the calculate_area function (169:183) to the calculate_circumference function. Nor why knocking out self attention between some of the other function definitions improves performance.

3.1 Conclusion 

Our exploration of information flow in LLMs has provided some interesting insights, yet there is clearly still much more to learn. Moreover, more experiments are needed to verify if our findings generalize. Our results for Pythia on the code-completion task, suggest two hypotheses about features of information flow. Their refinement and verification could be the subject of further research. Firstly, for larger prompts, information flow may be highly distributed across token positions, implying that individual tokens contribute minimally to the model's ability to solve a task. Instead, the model relies on the aggregation of information from various tokens to make accurate predictions. The second hypothesis concerns how information is aggregated. For a semantic unit like a function definition, it does not appear that information from all the tokens comprising the unit is copied to the END position for predicting the next token. Rather, information seems to flow into a token near the end of the semantic unit, and then be copied to the END position from there. 

However, we acknowledge that our findings are based on a limited set of experiments, and further rigorous research is needed to solidify these insights. Our evaluation method involved comparing our intervention results with what is known about about the IOI task and ensuring consistency among different interventions. The relatively blunt interventions, such as attention-knockout, have proven valuable in shedding light on the model's internal structure. Still, we must remain cautious about potential "hemorrhage" effects that may lead to misleading results.

Moving forward, exploring various types of interventions and their effects on information flow may hold the key to gaining a deeper understanding of how language models process information. Such research could offer valuable insights into model behavior, leading to improvements in AI safety, interpretability, and alignment.


A. Path-Patching for the attention from i to j  

We do the following intervention on every head. Let  be the key, query, and value matrices of the clean prompt respectively.   and  are the key and value matrices of the corrupted prompt. We calculate a new key matrix whose  column is that of  and which is identical to  otherwise:  if  else . We use this new key and our original query to calculate a new attention pattern. In the language of the mathematical framework for transformer circuits:  where  contains the residual stream vectors for each position and  is the autoregressively-masked softmax function.  is created from  and  similarly to .  Usually, the output of an attention head into the residual stream at a position  is  where  is a linear combination of all residual stream positions, weighted by the attention paid to them:  .  For our intervention, we set  and leave the remaining  unchanged.  

B. Further Interventions

"Dampening" Intervention - multiplying attention with a scalar

Instead of knocking out attention completely, we can also dampen it by multiplying  with a scalar . This approach offers a valuable advantage - we can observe how our metric responds as we vary attention from  to . Setting  allows us to asses the effects of decreasing attention, while setting  enables us to study situations where  pays more attention to  than usual.

"Crop" Intervention - corrupting a prompt by dropping a prefix

If the zero pattern intervention means making a residual stream position  blind to the value at position , the crop intervention takes this further by making  blind to the whole residual stream up to and including position . Another way to think of this is that when intervening on the information flowing from  to , we employ a corruption intervention where the corrupt prompt is the same as the clean prompt but with all tokens up to  dropped. This intervention is intended to allow us to study how the absence of information from the preceding context influences the model's understanding and information flow.

C. Optimization Strategies


Drawing from the insight that our metric depends on predicted logits for the next token, we devised a simple yet efficient strategy to significantly reduce the number of required interventions. Here's how it works: we first compute the information flow from any preceding position towards the last position, as the output at this position provides the crucial logits we seek. Next, we select only those token positions whose connection to the final position surpasses a certain threshold. Then, we compute the connections into those positions from all preceding positions and again select only those whose strength surpasses the threshold. By repeating this, we iteratively compute paths of vital information flow leading to the final token.

For instance, let's consider the Indirect-Object Identification (IOI) task with the prompt "When Mary and John went to the store, John gave a book to". We start by computing the connections towards the final token " to". For a metric that accurately captures the task, we would expect strong connections from " Mary" (via name-mover heads) and the second " John" (via S-inhibition heads). If this is the case, we can safely skip computing the strength of any connections into " gave", " a," or " book." Instead, we proceed to compute information flowing into the second " John," where we might find that only connections from the first " John" (via duplicate-token heads) and " went" (via induction heads) are significant. Consequently, we can bypass examining any tokens related to "to the store". This iterative process continues until we either reach the first token or exhaust all remaining tokens for investigation.


For larger models, the backtracking strategy can still exhibit slow performance, as it involves conducting numerous interventions before narrowing down the search space. For each token we consider, we must calculate all connections from previous tokens. To address this, we introduced the bisecting strategy, which aims to narrow down potentially significant connections before conducting interventions for individual token pairs. This strategy treats does interventions on partitionings of the input rather than individual tokens.

Initially, the input is split into two equal-sized partitions, and we calculate the importance of their connection. Subsequently, we further split these partitions into "child" partitions. We only calculate connections between partitions with different "parents" if their parent partitions exhibit a connection strength exceeding a specified threshold. This way, the bisecting strategy efficiently identifies which connections are more likely to be important, reducing the overall number of interventions needed.

Additionally, we implemented a combination of backtracking and bisecting. This hybrid approach calculates information flow backwards starting at the final token position and applies the bisecting strategy when determining important connections into a token.


As we encountered longer prompts, we found that simply bisecting the input into equal partitions might not be the most appropriate approach. Instead, we sought to establish a more meaningful partitioning based on the hierarchical structure of the text's semantic units, such as paragraphs, function definitions, sentences, clauses, or statements.

To address this need, we devised the splitting strategy, which allows us to define a hierarchy of delimiters for creating partitions. For instance, at the highest level, we could use newlines to split the prompt into paragraphs, and at a lower level, we could employ various punctuation signs (e.g., ".", "!", "?") to split the text into sentences. By implementing such a hierarchical approach, we are able to construct partitions that aligned with the natural organization of the text. Once these partitions are defined, we proceed to calculate their connectedness using a similar method as with the bisecting strategy.

  1. ^

    We believe that setting attentions scores to  before applying softmax is less principled as it increases the attention that is paid towards the remaining positions. 

New to LessWrong?

New Comment