Some reasoning steps in a large language model’s chain of thought, such as those involving generating plans or managing uncertainty, have disproportionately more influence on final answer correctness and downstream reasoning than others. Recent work (Macar & Bogdan) has discovered that models containing attention heads which attend solely to important sentences.
Counterfactual importance is a method that uses resampling to determine the importance of a sentence in generating the correct answer. This post extends Thought Anchors and tests if ablating the receiver heads impacts the counterfactual importance of plan generation or uncertainty management. [1]
Using the original full text Chain of Thought for the Math Rollouts dataset from problem 6481
I created new rollouts with Deepseek R1 Distill Llama 8B and Deepseek R1 Distill Qwen 14B.
I generated 12 rollouts and measured the counterfactual importance of the model from random ablation, no ablation, and receiver head ablation. The code used to generate the rollouts and analyze the rollouts.
The code used to find the receiver heads which uses the Math rollouts full_cots and gets the activations across the questions and calculates the kurtosis across the questions.
As the counterfactual importance calculations in thought anchors used the Cosine similarity of the embeddings when using a sentence model. I also explored the counterfactual importance using cosine similarity values from the embeddings from the tokenizers for Deepseek R1 Distill LLama and Deepseek R1 Distill Qwen 14B respectively.
Counterfactual importance is calculated by re-running rollouts without a specific sentence multiple times and shorting these sentences between similar and not similar and calculating the KL_divergence between the accuracies of the similar and not similar rollouts. This is used to determine how much a similar embedding to the original sentence affects the accuracy of the answer.
Dropdown Sentences: A behavior of the model after receiver head or random ablation, where a continuous group of sentences that have counterfactual importance close to 0 and have their range increase after ablation.
Ablating Receiver Heads
Sent | LLM | |
No Ablation | ||
Ablating Receiver Heads | ||
Difference (Non Ablated - Ablated) |
Ablating Random Heads
Sent | |
No Ablation
| |
Random Ablation | |
Difference (Non Ablated - Random Ablated) |
Sent: Refers to the Sentence Model Embeddings, which are used to create cos_similarity and used with the similarity threshold of 0.8 to calculate counterfactual importance.
LLM: Using cos_similarity using embeddings directly from the tokenizer of the model instead of the sentence model, like the Thought Anchors Paper.[2]
Graphs Counterfactual Importance by Sentence Category (Box Plot)
No_ablation | Ablating Receiver Heads | |
Sent | ||
LLM |
Random Ablation on Counterfactual Importance
(Box Plot)
Random Ablation | |
---|---|
Sent | |
LLM |
Looking at the sentences with the greatest increase and decrease in counterfactual importance, lines up similarly to how the counterfactual importance of plan generation increased and decreased for active computation. Since active computation was more present in the sentences with the greatest decrease, while plan generation was absent from the sentences with the greatest decrease.[3]
Greatest Increase in Counterfactual Importance | No_ablation vs Receiver Head Ablation | Random Ablation |
Sent | ||
LLM |
Greatest Decrease in Counterfactual Importance | No_ablation vs Receiver Head Ablation | Random Ablation |
Sent | ||
LLM |
Sent | |
LLM |
These findings suggest several interesting directions for future research:
Attribution graphs for the models used in thought anchors is not possible to my knowledge as I could not find transcoders for the specific models with deepseek distilled.
I am aware that distillation could be considered a type of finetunning so it would be possible to use transcoders for llama 8b or qwen 14b. However, the transcoder for Qwen 14b currently for Qwen 3 which is different from Deepseek R1 Distill Qwen 14B which was originally Qwen 2.5. So finding the compute to either distill Deepseek R1 into Qwen 3 14B or making new transcoders for Llama 8B was out of the scope of the project.
Ablating the receiver heads for Qwen3 14B had no significance. The following is my attribution graph code.
Thanks to Uzay Macar and Abdur Raheem Ali for helpful feedback on an earlier version of this draft
Glossary (Words from Thought Anchors):
Kurtosis - the degree of “tailedness”
Receiver Heads - Attention heads found in thought anchors that narrow attention toward specific sentences
Counterfactual Importance - A black box method used in Thought Anchors to determine importance
Glossary (Original Terminology Introduced in this experiment):
Sent: Refers to the Sentence Model Embeddings, which are used to create cos_similarity and used with the similarity threshold of 0.8 to calculate counterfactual importance.
LLM: Using cos_similarity using embeddings directly from the tokenizer of the model instead of the sentence model, like the Thought Anchors Paper.
Dropdown Sentences: A behavior of the model after receiver head or random ablation, where a continuous group of sentences that have counterfactual importance close to 0 and have their range increase after ablation.
ex) The top graphs are of full counterfactual importance, the bottom are a zoomed-in demonstration of sentences 100 - 125. Where, after ablation, there is a continuous list of close to 0 counterfactual importance values. After ablation, the list of sentences with close to 0 counterfactual importance grows to 105-115 after ablation.
Ablating the receiver heads led to a doubling of the counterfactual importance of plan generation and also caused a decrease in the counterfactual importance of active computation. This appears to be a unique behavior to ablating receiver heads, as random ablation leads the counterfactual importance of all sentence categories to go to 0, except for plan generation, which shows an increase.
Explore Changes in counterfactual importance After Ablation
The counterfactual importance across all sentences decreases after ablation (github)
Ablating in general caused the Cosine Similarity of the problem setup category to go from being close to a consistent cos_similarity of 1 to having a cos_similarity that varies significantly more.
Using Cosine Similarity, the final answer emission category has more variance when using Embedding Values directly from the tokenizer. (github)
Absolute Difference of Counterfactual Importance
All sentence categories except final answer emission were present in Highest and Lowest difference in Absolute value difference.
Fact retrieval is the most common in both groups for the greatest and lowest change in absolute difference