Understanding Reasoning with Thought Anchors and Probes

JeaniceK; Matthew Robbins; Johannes Taraz

This project was conducted as a capstone for the ARENA 7.0 program by JeaniceK (Section 1), Matt Robbins (Section 2), and Johannes Taraz (Section 3). Equal contribution from all contributors.

The ARENA Capstone is a 5-day project during which participants dig deep into topics covered during the course. We focused on mechanistic interpretability, applying and adapting techniques from the recent paper on Thought Anchors (resampling, causal masking, receiver heads, etc.) to the context of legal reasoning in LLMs. The text below is a write up of our approaches and results. We cover resampling importance and attention analysis, early stopping, and probes to track the LLM’s judgment over the course of its reasoning.

TL;DR

Resampling: Thought anchors exist in legal reasoning but center on fact retrieval rather than planning. Attention weights show only weak correlation with causal importance.

Causal Masking: Sentence dependencies are dominated by local (adjacent) relationships, but long-range dependencies exist—though many may be artifacts of token overlap rather than genuine reasoning. We extract interpretable "reasoning chains" by following high-dependency paths. Masking facts from the indictment primarily affects sentences that restate those facts, suggesting the model echoes evidence rather than deeply transforming it, at least early in the trace.

Receiver Heads: R1-Distill-Llama-8B shows receiver heads concentrated in later layers with clear vertical attention stripes, consistent with Thought Anchors’ original findings on math reasoning.

Early Stopping: By injecting the string “VERDICT:” into the model while it’s reasoning, we see whether the model is leaning toward “innocent” or “guilty” throughout the reasoning, giving us the model’s “judgment-leaning”. In particular, we see that models often change their mind late in the reasoning process.

Probes: The judgment-leaning, obtained via early stopping, can be successfully learned by attention probes (other probe types fail). Probes also work well for highly ambiguous court cases.

Cross-Technique: Different importance metrics showed little qualitative overlap, unlike the high correlations reported in Thought Anchors (though we used different models for different techniques, limiting the comparisons we could make).

Introduction

We set out to investigate chain-of-thought reasoning in modern language models. Specifically, to understand when and how models "make up their mind" when reasoning over complex factual scenarios. Drawing on techniques from recent interpretability research, we replicate methods from Thought Anchors: Which LLM Reasoning Steps Matter? and probe architectures from Building Production-Ready Probes For Gemini to investigate the internal structure of legal reasoning.

Thought Anchors approach: The authors of the Thought Anchors paper showed that analyzing reasoning traces at the sentence level reveals how certain sentences, which they call "thought anchors", disproportionately impact the reasoning trajectory and final answer. The authors measured the importance of individual sentences in a model's reasoning trace by repeatedly sampling replacement sentences from the model, filtering for semantically different ones, and continuing the chain of thought from that point onwards to quantify the sentence’s impact on the distribution of final answers. The authors found thought anchors were typically planning sentences (establishing approach/strategy) or uncertainty management sentences (reconsidering/correcting reasoning). The authors also found that specialized attention heads consistently attend from subsequent sentences to thought anchors. The authors further show that examining sentence-sentence causal links within a reasoning trace gives insight into a model’s behavior.

Probes approach: Training probes (i.e., classifiers) on the activations of a neural network is an established, cheap method for gaining insight into the neural network’s inner workings. Recently, Google DeepMind (“Building Production-Ready Probes For Gemini”) investigated new probe types to better handle long context prompts, among other things.

Our ARENA capstone project: We replicate and extend the thought anchor-based methods to legal reasoning. We also train probes to read the model's internal "opinion" at each reasoning step. Along the way, we find some things that work, some things that don't, and a few surprises.

Task Description

We investigate the ability of DeepSeek-R1-Distill-Llama-8B (and some other models) to reason over synthetic criminal court cases. Our dataset consists of 92 cases generated using Claude 4.5 Opus, each containing a defendant, a charge, and a detailed indictment with ambiguous evidence.

Example case:

Case #21
Defendant: Steven Brooks
Charge: Possession with Intent to Distribute
Indictment: Defendant Steven Brooks is charged with possession with intent to distribute controlled substances after police discovered 47 grams of cocaine in his apartment during a search pursuant to a warrant. Brooks claims the cocaine was for personal use only and that he had no intent to distribute. The prosecution argues the quantity exceeds personal use amounts and that distribution paraphernalia found in the apartment indicates intent to sell. The 47 grams is above the statutory threshold creating a rebuttable presumption of intent to distribute, which is 28 grams in this jurisdiction. Brooks argues he purchased in bulk because his supplier offered a significant discount and that he has a high tolerance due to years of personal use, which he estimates at 3-4 grams daily[...]

The model receives a system prompt instructing it to act as an LLM assisting a judge, and a user prompt presenting the case. It then generates a reasoning trace in <think> tags and outputs a verdict (_guilty or _innocent). We use "innocent" rather than "not guilty" since the former is a single token, simplifying analysis.

*Example case description, reasoning trace and verdict.*

Ambiguous vs. unambiguous cases:

We label cases as ambiguous or unambiguous by sampling 10 reasoning traces for each case and noting the variance in the verdicts the model reaches. Cases which received between 3 and 7 (inclusive) innocent and guilty verdicts were labeled ambiguous. Most of our analysis focuses on the 18 ambiguous cases identified.

Research Questions

We primarily investigated:

How do models reason over legal cases? Taking inspiration from Thought Anchors, we chose sentences as our unit of analysis for "reasoning steps." We want to understand which steps matter and how these steps relate to each other.
When do models "make up their mind"? How important are early steps in determining final conclusions? Does variation early in the trace largely determine the ultimate verdict? In what ways can we measure the model's “judgment-leaning” over the steps?
Which techniques provide useful insights? We implement multiple methods from recent interpretability work to see which converge on similar findings and which reveal complementary structure.

Section 1: Resampling and attention analysis

Resampling: Do the resampling results from the Thought Anchors paper translate to legal reasoning?

Method: We adapted the codebase of the original Thought Anchors paper to legal reasoning and evaluated sentence importance by:

Taking a reasoning chain (step-by-step verdict delivery);
Replacing one sentence with an alternative sentence sampled from the model;
Continuing from after that sentence and observing whether the verdict changes.

Similar to the Thought Anchors paper, we measured the importance of a sentence by whether the resampling of this sentence switches the verdict between guilty and innocent. This frames sentence-level importance as a question of counterfactual influence: if we resample from this sentence onwards, how does this affect the verdict?

Thought anchor sentences derive their importance from disproportionately impacting the reasoning trajectory and final answer. The method to calculate counterfactual importance is described in detail in the original paper (section 3.2).

Our setup: We used Llama 70B and applied resampling to the 18 ambiguous cases. Similar to the Thought Anchors paper, we used 100 rollouts for each case and we used OpenAI GPT-4o (February 2026) to categorize the sentences. We also replicated the experiment with Qwen 1.5B, 5 cases with 10 rollouts, to consider whether the findings generalize across model scales and to conduct further mechanistic interpretability tests.

Results: We found that in the legal context, both models engage in a large amount of fact retrieval and result consolidation (aggregating results, summarizing, preparing). Unlike the original paper, we find that thought anchors were generally sentences related to fact retrieval (recalling facts, formulas, problem details), rather than sentences that reflect planning or uncertainty management.

*Categorization of sentences for Qwen 1.5B (left) and Llama 70B (right)*

We suspect this behavior reflects differences in the task. Where mathematical reasoning requires strict planning and sequential reasoning steps, legal reasoning requires synthesizing a broad array of facts. Thus, the model here must spend more of its reasoning steps sweeping over and consolidating the facts of the case to construct a final judgment.

Similar to the original Thought Anchors paper, our findings indicate the presence of thought anchors. We observed two patterns related to resampling importance: first, we observed ‘load-bearing anchors’, where a single sentence's resampling degrades an outcome otherwise consistent with the original verdict. This can be observed as valleys in the examples below.

*Example of* load-bearing anchors *pattern*

We also observed ‘asymmetric sentence sensitivity’: removing some sentences has little effect on verdict stability, while removing others is highly destabilizing, indicated by swings of 40–100% accuracy [deviation from the baseline]. This can be observed as peaks and valleys in the examples below.

*Example of* asymmetric sentence sensitivity *pattern*

Attention analysis: Do attention patterns reflect causal importance?

In investigating whether attention patterns reflect causal importance, we hypothesized that causally important sentences (i.e., those that cause large deviation from the baseline when resampled) should receive disproportionate attention across the reasoning trace. We tested this at two levels: generally, by measuring average attention to individual sentences across the full trace, and specifically, by examining whether the sentence in which the verdict is passed [this is always the final answer] pays particular attention to thought anchors.

Results: For the general analysis, i.e., measuring average attention across the full trace, results were consistent with the recency bias in transformer attention: high attention went to recent sentences regardless of resampling importance. This dominated the signal: most thought anchors received below-average attention, with only a weak positive correlation between attention and causal importance (Pearson r = 0.23).

*Sentence level attention heatmap (left) and resampling pattern for case 21, Qwen 1.5B (right)*

For the specific analysis, we similarly found no clear indicators that the verdict sentence pays particular attention to thought anchors, except for some cherry-picked instances, exemplified in the figure below.

Case 21, top 5 attention scores highlighted in red, with sentence 18 being a thought anchor

Future work could implement a more targeted approach, focusing on syntactically or semantically salient tokens such as those carrying the key legal claim of each sentence. This would avoid averaging attention across all tokens within a sentence, which may obscure token-level patterns in our current set up. Testing on a larger set of cases and models would also help clarify whether the weak correlation (Pearson r=0.23) reflects a genuine dissociation between attention and causal importance, or simply insufficient statistical power. In the following section, we dive deeper into causal masking and probes to shed light on how models make up their minds.

Section 2: Causal Masking and Sentence Dependencies

*Heatmap showing causal dependencies between sentences for Case 21, Sample 1*

Following Thought Anchors, we measure how masking sentence affects the probability of the model (R1-Distill-Llama-8B) generating sentence (where ):

For each sentence in a reasoning trace, we remove it from the input and measure the effect on subsequent token logits using KL divergence from the original logit outputs. Averaging token-level effects over the sentence gives us the direct causal effect of each sentence on every future sentence . As an example in the figure below, the value in row 2, column 5 in the array above is given by computing .

For implementation simplicity, we removed sentences entirely rather than masking attention to them (as done in Thought Anchors). The only difference is positional embeddings for downstream tokens.

Depiction of the inference steps used to generate the probabilities used in computing the masked importance score. Left: the base token probabilities for sentence 5 are computed via a single forward pass (green). Right: sentence 2 is masked in the reasoning trace, and token probabilities for sentence 5 are again generated via a single forward pass (green). The importance of sentence 2 on sentence 5 is then given by the average log-KL-divergence between the masked and unmasked probability distributions at each token position.

Local vs. Long-Range Dependencies

We find, unsurprisingly, that local dependencies dominate. Sentences depend most strongly on their immediate predecessor, producing a strong near-diagonal signal in our dependency matrices (seen above). This is expected but worth confirming.

However, long-range dependencies do exist. Some off-diagonal entries show substantial causal effects, indicating that certain early sentences continue to influence reasoning many steps later. For example, in case 21, sentence 3 shows a notable causal effect on sentence 19 — a gap of 16 sentences:

Sentence 3: "The prosecution says that in this jurisdiction, the threshold for rebutting the claim of personal use is 28 grams, which Brooks exceeds."
Sentence 19: "The prosecution says the amount is above the threshold and the paraphernalia points to distribution."

However, there's a caveat: these two sentences share substantial overlapping tokens ("The prosecution says," "the threshold"). This signal — and several other long-range dependencies we found — may be better explained as artifacts from components like induction heads rather than genuinely interesting reasoning behavior. Future work could investigate whether specific circuits consistently contribute to/explain these long-range dependencies.

Reasoning Chains

One application of causal attribution patterns is extracting reasoning "paths" or "chains" by connecting sequences of sentences with high causal dependence. If sentence pairs and both show strong dependencies, we can form a chain . We use a score threshold with an adjacency penalty to generate these chains.

For case 21, this technique produces:

[3] The prosecution says that in this jurisdiction, the threshold for rebutting the claim of personal use is 28 grams, which Brooks exceeds.
[17] So, the main issue is whether 47 grams, especially with the paraphernalia, indicates intent to distribute.
[18] Brooks argues it's for personal use because he uses a lot.
[19] The prosecution says the amount is above the threshold and the paraphernalia points to distribution.
[24] While the amount is over the threshold, Brooks's high usage and plausible financial situation might support his claim of personal use.
[25] The lack of distribution history and no evidence of actual sales or customers makes it harder for the prosecution.
[26] So, I'm leaning towards innocent because the evidence, while suggestive, isn't conclusive of distribution intent[...]</think>

Reading through this chain, it provides a fairly coherent, if not repetitive, reasoning through line from the full trajectory, moving from the key legal threshold, to the central question, to competing arguments, to the final weighing of evidence.

Masking Evidence from the Indictment

*Heatmap showing effect of masking indictment sentences on downstream reasoning trace sentences*

We also tried masking sentences from the original indictment rather than from the reasoning trace itself to see how evidentiary facts influence reasoning steps.

What we primarily find is that the model restates case facts early in its reasoning trace. This produces a strong approximately-diagonal pattern of red squares in our dependency matrix — the sentences most affected by masking a fact are simply those that restate that fact. Even apparent longer-range dependencies appear to be restatements of the same facts later in reasoning rather than genuinely using those facts for inference.

This suggests the model's reasoning trace is somewhat repetitive, making it difficult to uncover the ways in which the model is building novel inferences upon the base facts of the case. To uncover these inferences, we would likely need to simultaneously mask the restatements of these facts to remove them entirely from context. Unfortunately, we did not have time to investigate this approach during our project.

Receiver Heads

Kurtosis scores for case 21 for attention heads across layers. Red dots correspond to attention heads with highest average kurtosis across all cases and samples. Most receiver heads are found in layers 19-31.

Following Thought Anchors, we searched for "receiver heads", i.e., attention heads that consistently attend to specific source sentences from all downstream positions, effectively "broadcasting" certain sentences to the rest of the reasoning trace. We identify these by computing the kurtosis of each head's attention pattern: high kurtosis indicates a high degree of tailedness, meaning some source sentences receive disproportionate attention across all query positions.

*Example sentence-level attention patterns from top receiver heads, showing vertical striping pattern*

We found receiver heads in R1-Distill-Llama-8B concentrated in later layers (roughly layers 19–31). The sentence-level attention patterns show clear vertical stripes — individual source sentences attended to from many downstream positions. These patterns are qualitatively similar to those reported in the Thought Anchors paper for mathematical reasoning, suggesting this is a general architectural phenomenon rather than domain-specific.

Section 3: Early Stopping

To track the model's (R1-Distill-Llama-8B) judgment-leaning throughout reasoning, we use early stopping:

*Schematic of early stopping. The sentences S1, …, S5 form the normal chain-of-thought (CoT). The string “VERDICT:” can be injected into the model after* *(e.g., 3) sentences from the CoT, then the model produces an early, or premature, verdict* *(e.g.,* *). In fact, before producing a verdict, it produced logits for the tokens “_innocent” and “_guilty”. The difference between these logits is denoted as* *(e.g.,* ).

We iterate over all sentences and in the step we consider all sentences up-to (and including) the sentence together with the string “VERDICT: ” as the input sequence: "VERDICT:". Then, we record the logit difference: logit(_innocent)logit(_guilty), for all input sequences. This is a continuous measure of the model's current judgment. Positive values indicate the model leaning towards _innocent, negative values indicate leaning towards _guilty.

This method is adapted from Measuring Faithfulness in Chain-of-Thought Reasoning. The early stopping results reveal that models can change their mind late in the reasoning process. The logit difference is not monotonic: it oscillates as the model considers different pieces of evidence, and meaningful shifts can occur even in the final sentences. In some cases, the sign of the model's judgment-leaning flips partway through the trace. The following visualization of case 21 illustrates this clearly. It also includes the scores of a probe whose training is described in the next section.

The early stopping results show direct relation to the text, e.g., the sentence “So weighing all this: The quantity is high, more than the threshold, and he has paraphernalia.” is followed by a strong spike in the “guilty”-direction, whereas the next sentence “But he explains the scale as a diet tool and the bags for food.” is followed by a swing back to neutrality. Thus, the chain-of-thought and the early stopping verdicts are qualitatively faithful to each other.

Probes

Can we read the model's current judgment-leaning directly from its activations, without forcing it to emit a verdict? We trained probes on activations from R1-Distill-Llama-8B (at layers 8, 16, and 24) to predict the early stopping logit difference at each sentence position, i.e., the probe input is the set of activations for each token in at a given layer.

We tested six probe architectures from DeepMind’s probe-paper: linear with average pooling, linear with exponential moving average (EMA), multi-layer perceptron (MLP), attention-based, max of rolling means (MRMA), and multimax.

*Taxonomy of probes described in recent DeepMind probe-paper*

Training setup: Probes were trained with MSE loss on the logit difference targets of the reasoning sentences of 16 cases (containing a mix of both ambiguous and unambiguous cases) and evaluated on held-out cases. Or, more formally, a probe is trained to minimize the loss: .

We also varied a frac parameter controlling what proportion of each reasoning trace's sentences were used for training: frac=1.0 uses all sentences, frac=0.34 uses only the last 34%, and frac=0.1 uses only the last 10%. Lower frac values mean fewer training samples, so worse performance is partly expected. We measure accuracy of a probe as the fraction of sentences where the probe's sign matches the target's sign.

What works: Attention probes at layers 16 and 24 perform well, achieving up to 95% sign-match accuracy on held-out legal cases. Even with the restrictive frac=0.1 setting, attention probes at layer 16 maintain 85% accuracy. MLP probes show moderate performance, particularly at layers 16 and 24 with frac=1.0 (77–89% accuracy).

What doesn't: Linear probes are noticeably weaker (best 85% at layer 16, frac=1.0). EMA, Max of Rolling Means Attention (MRMA), and multimax probes essentially fail; their accuracies hover around 23–33%, suggesting they converge to trivially predicting one class. It is notable that attention probes succeed, while other non-linear probes do not. It suggests that the relevant signal in the hidden states may require attending over the full sequence of sentence representations rather than simple aggregation. It is possible that further hyperparameter optimization would yield better results; we adopted the default values from DeepMind’s probe-paper.

Probe	Linear, mean	EMA	MLP, mean	Attention	MultiMax	MRMA
Accuracy	0.72	0.24	0.81	0.95	0.33	0.5

Table: For each probe type we chose the best performing layer and frac=0.34

We find that probes do not systematically perform worse on ambiguous cases. This is a reassuring sign that probes aren't just picking up on easy cases. When we tested the probes trained on legal cases on other judgment domains, we got mixed to poor results.

Between working on this project and writing this report we found Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering; a work exploring similar techniques (probes), for entirely different question domains, finding they can predict the final answer using a probe, ahead of the CoT.

Conclusion

Cross-Technique Correlations

While we didn't run formal correlation analyses between our various importance metrics, we qualitatively found very little overlap between sentences scoring highly on different metrics, at least on the samples we investigated in detail.

The original Thought Anchors paper reports high correlation between resampling importance and causal-masking-based importance. However, since we used different models for these two analyses (DeepSeek-R1-Distill-Llama-8B for masking vs DeepSeek-R1-Distill-Llama-70B for resampling), we were unable to verify this finding in our setting. This remains an important direction for future work with unified infrastructure.

Takeaways

Our main takeaway is that the Thought Anchors framework — developed in the context of mathematical reasoning — translates meaningfully to legal deliberation. Reasoning models working through court cases exhibit critical junctures, long-range causal dependencies between reasoning steps, and characteristic attention patterns that mirror what has been found in the math setting. At the same time, the model's judgment-leaning (as measured by early stopping and probes) can shift throughout the chain-of-thought, suggesting genuine deliberation rather than early commitment followed by rationalization — at least in some cases.

Acknowledgements

This project was conducted as a capstone for the ARENA 7.0 program, where it was awarded best project. We are grateful to the entire ARENA team for making this possible: James Hindmarch (Programme Lead), Callum McDougall (Founder), Joly Scriven (Operations Lead), David Quarel (Head TA), Nicky Pochinkov (TA), Chloe Li (Strategy and Curriculum Developer), and James Fox (Advisor).

14