Good work! I'm curious why theres a sudden dip in the gemma 2-9b at the last token position, and why probes trained on Qwen don't seem to have any relationship.
Quite a bit of literature indicates that the intermediate activations output by the MLP block are the sum of several different features in superposition, in which each feature is some vector. I would be curious if you can do an SAE or SNMF and see if one of these features is strongly associated with answering correctly.
I think that on Qwen-1.7B, the probes might be less accurate, but I wouldn't conclude that definitely since the model had 73% accuracy vs. Gemma3-27B's 86%, so it might be the model that underperforms instead of the probes, and also the confidence interval is wider since on Qwen, I used 250 questions instead of 500.
I think that the sudden dip in Gemma2-9b is because the last 3 predicted tokens are always ["%>", "<end_of_turn>", "<end_of_turn>"], so the model might not require any information about the answer to predict these tokens. Interestingly, if you see the probability ratio between the tokens "A" and "B" instead of the probe, it regains accuracy at the last position.
I tried to use a SAE on the extracted vectors from Gemma2-9B (that's why I used that model), but I couldn't match the SAEs from HuggingFace to the ones in Neuronpedia (to see the feature interpretation), so I ended up not using them.
This project was carried out as part of the “Carreras con Impacto” program during the 14- week mentorship phase.
Large language models (LLMs) often use Chain-of-Thought (CoT) to tackle complex problems, breaking them down into intermediate reasoning steps. This has dramatically improved their performance, but it raises a critical question for AI safety and interpretability: is this reasoning process faithful? Does the written rationale genuinely reflect the model's internal computation?
A powerful tool for investigating CoT faithfulness would be a simple, reliable method for tracking the model's "belief" about the final answer (or other type of relevant information) throughout its reasoning process. If we could train a simple probe to read out this belief from the model's internal state at each generated token, it would offer a new way to audit the model's reasoning. We could observe the dynamics of its conviction: does it start uncertain and gradually build confidence? Most importantly, we could pinpoint the exact tokens where the model's belief shifts, for instance, after processing a key piece of evidence or correcting a flawed assumption. Verifying that these belief changes align with the written rationale would provide evidence for the faithfulness of the chain of thought, and it would provide various methods for monitoring LLMs.
The promise of this approach, however, hinges on a fundamental question: Do LLMs even maintain a coherent, easily decodable belief about the final answer during the generation of these intermediate steps? Is there a consistent signal we can extract, or is the final answer only synthesized at the very end of the process, making the intermediate states opaque to simple inspection?
This post investigates this question. I attempt to train simple linear and non-linear probes to predict the final answer to complex questions by looking only at the activations from intermediate CoT tokens. In the following sections, I will detail the experimental setup, present the results from probing multiple models, and discuss what these findings suggest about the nature of multi-step reasoning in LLMs.
My research sits at the intersection of two key areas in modern interpretability: using probes to read internal model states and investigating the faithfulness of chain-of-thought reasoning.
Activation Probing for High-Level Concepts: A significant line of research has shown that simple linear probes trained on internal activations can successfully decode high-level properties of a model's state. One example is the work on Designing a Dashboard for Transparency and Control of Conversational AI where linear probes are used to extract LLM beliefs about the user, such as age, gender, and socioeconomic status.
The Unfaithfulness of Chain-of-Thought: While CoT has greatly improved model capabilities, a growing body of evidence, such as Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting suggests the generated reasoning is not always a faithful representation of the model's actual computational process.
Tracking Beliefs with Active Intervention: One approach to auditing CoT is through active intervention, as explored by How LLM Beliefs Change During Chain-of-Thought Reasoning. In their experiments, they generate a partial CoT, truncate it, and then directly prompt the model for its confidence in the final answer. This method has a key vulnerability: it is not robust to strategic deception. A model could recognize it is being audited and report a plausible confidence level, regardless of its true internal state.
Tracking Beliefs with Passive Probing: Closer to the methodology of this post is the work in Measuring Beliefs of Language Models During Chain-of-Thought Reasoning. They similarly train linear probes on model activations to predict the final answer during CoT generation, providing a more passive and potentially deception-resistant measurement.
My work builds directly on this foundation but extends it in three key ways to provide a more thorough analysis:
The core of my experiment is to test whether a model's intermediate activations during chain-of-thought (CoT) generation contain decodable information about the final correct answer. To achieve this, I set up a probing task where a classifier tries to predict the ground-truth answer to a question using only the activation vector from a single token position in the model's reasoning process.
Model and Dataset
Tasks and Prompts
For each question, the model was prompted to generate a chain of thought before providing its final answer. I experimented with two different prompt formats to ensure the findings were not an artifact of a specific task structure.
Probing Methodology
For each of the 500 questions, I first generated and saved the model's full CoT response. The probing analysis was then performed on these saved activations.
Evaluation
My evaluation process was tailored to the probe type to ensure robustness.
To investigate when a model forms its belief, I trained probes on activations from the final 100 tokens of the reasoning chain. For our primary model, Gemma-3-27B, on the A/B prompt, probe accuracy remained near random chance until the final 40 tokens, where it started slowly increasing, only spiking to the actual model’s performance in the last 10 tokens. As a comparison, the chain of thought is on average 260 tokens long. This suggests the final answer is mainly computed at the final position.
In the probability prompt, the fact that the linear probes can extract the answer earlier in the CoT is likely due to the prompt instructing the model to answer early and then calculate the probability of the answer being correct.[1]
While the "late emergence" of the belief in time is one part of the story, the performance of the probes across the model's depth tells us something about what is being learned. If the probes were merely learning a simple input feature or just predicting the final output tokens, we would expect their accuracy to be highest at the very beginning or the very end of the model.
Instead, the results suggest that the probes are detecting a weak, internal representation of the answer that the model constructs during its computation. The figure below, showing peak probe accuracy by layer, provides evidence for this hypothesis.
Finally, to understand what features the probes were using, I projected the learned probe directions onto the model's vocabulary using a logit lens (the unembedding matrix). Interestingly, for both the probability and the A/B prompt probes, one of the tokens with the highest negative weight (predicting Answer B) was ' second'. However, this token only appears after the most predictive layers, and the tokens that appear in the most predictive layers before are seemingly random, suggesting the extracted representation of the concept is more complex than a token, and only in later layers it is associated with the token ' second'.
Layer 39: Bottom 10 Tokens (Anti-aligned with the 'Probability Classification Probes' direction):
Layer 39: Bottom 10 Tokens (Anti-aligned with the 'Yes/No Classification Probes' direction):
My results strongly suggest that for complex, multi-step reasoning tasks, transformer models, in part due to their architecture which only allows information to pass a limited amount of time through its layers, do not maintain a linearly decodable, evolving "belief" about the final answer throughout their chain-of-thought (CoT) generation. Instead, it is more likely that the model primarily uses the CoT as a computational scratchpad, laying out intermediate steps as a context, and only computes its final answer, and thus its belief, very late in the process by attending over this entire generated sequence. However, in the process of predicting the next token, the model forms a simple estimate of what the answer will be, thus it is linearly extractable but with a low accuracy.
6. Future Work
This research opens several directions for further investigation:
You are given a question with two possible answers. Your task is to determine if ANSWER A is correct.
Instructions:
First, write your chain of thought analyzing the question and both answers
Consider all relevant information and reasoning
End with ONLY <YES> if answer A is correct or <NO> if it is not correct
Example format:
[Chain of thought analyzing the question and evaluating both possible answers based on factual accuracy, logical reasoning, and available evidence]
<YES> or <NO>
Now analyze this question:
Question: {question}
FIRST ANSWER: {first_answer}
SECOND ANSWER: {second_answer}
You are given a question with two possible answers. Your task is to analyze the question and determine the probability that ANSWER A is correct.
Important context:
- The two answers have been randomly ordered, so without any other information, the base probability is 50%
- Your prediction will be evaluated using the Brier score, which rewards well-calibrated probabilities
- Focus on being accurate rather than confident
Instructions:
1. First, determine which answer you believe is correct (A or B)
2. Then determine your confidence level in that choice (as a percentage)
3. Finally, convert this to the probability that Answer A is correct:
- If you chose Answer A: use your confidence level directly
- If you chose Answer B: use (100 - your confidence level)
Chain of thought format:
- Analyze the question and evaluate both answers
- State which answer you believe is correct and why
- State your confidence level in that choice
- Convert to final probability for Answer A
Example 1 (choosing Answer A):
Question: What is 15 + 27?
Answer A: 42
Answer B: 41
Chain of thought: This is a basic arithmetic question. 15 + 27 = 42. I can verify this: 15 + 20 = 35, then 35 + 7 = 42. Answer A (42) is mathematically correct, while Answer B (41) is incorrect by 1. This is straightforward arithmetic with no ambiguity.
I believe Answer A is correct with 95% confidence.
Since I chose Answer A, the probability that Answer A is correct is 95%.
<95%>
Example 2 (choosing Answer B):
Question: Which gas makes up the largest percentage of Earth's atmosphere?
Answer A: Oxygen
Answer B: Nitrogen
Chain of thought: This is about atmospheric composition. While oxygen is essential for life and very important, nitrogen actually makes up about 78% of Earth's atmosphere, while oxygen makes up about 21%. The remaining 1% consists of argon and other gases. This is well-established scientific knowledge about atmospheric composition.
I believe Answer B is correct with 90% confidence.
Since I chose Answer B, the probability that Answer A is correct is 100 - 90 = 10%.
<10%>
Now analyze this question:
Question: {question}
ANSWER A: {first_answer}
ANSWER B: {second_answer}
I used a probability prompt because in earlier experiments with Gemma3-27B I also measured the Brier score to observe whether the output probability from the linear probe was better calibrated than the model's output. However, for some reason the model's accuracy on a classic probability prompt was much lower than expected, and I had to change it to introduce the "probability calculation" part.