Why do Mechanistic Interpretability?

Any critique is welcome and highly encouraged. Please ask for clarifications if something does not make sense. Feel free to comment in this article or email me at prnaidu@ucsd.edu

Transformers are circuits we don't understand. The AI field leapt to find "interpretable" circuits from transformers to explain their behaviour. However, it is unclear what kind of explanations we want. We know the weights and we can manually calculate a transformer's forward pass. However, we still can't say much about the transformer even if we do so.

In this article, I want to expand on what types of statements we want to make about transformers. If possible, what kind of explanations would allow us to make those statements?

What is interpretability / explanation?

Say we have a parameterized model (for example a weather model) which makes predictions for some system W, where $θ$ is from some language $L_{1}$ ^[1]. We can view $θ$ as describing some theory $T$ . The model acts according to $T$ , if we understood $T$ , we would understand the model.

We say that we can interpret $T$ in another language $L_{2}$ , if we can describe $T$ with some $γ \in L_{2}$ . Furthermore, we should be able to use $γ$ to make the same predictions as $M_{θ}$ . Trivially, we can view $θ$ as an interpretation of $T$ . Why is this interpretation not useful? What more do we want from our interpretation?

Presume, we want to make a statement $S$ (for example," $M_{θ}$ makes predictions independent of the number of people living in a location"). If $L_{2}$ is a language along with some rules of inference, then we can use $γ$ and reason whether statement $S$ is true. The reasoning becomes our explanation for why $S$ is true.

Since statement $S$ is true based on $γ$ , we would want have high confidence that $S$ is true for $M_{θ}$ . Since both $θ$ and $γ$ describe the same theory $T$ .

So, what languages $L_{2}$ and what rules of inference should we be interested in? This would depend on the statements $S$ we would like to make.

What statements do we want to make?

Generalization: For a given task, we would like to state whether the model would generalize over a set of inputs $X$ . Say we do have a $γ$ for $M_{θ}$ . Then
1. If we have a theory $T_{H}$ for how the task can be generally solved,
  1. Let $γ_{H}$ be the description of $T_{H}$ in $L_{2}$ . We can compare $γ$ and $γ_{H}$ . If they are the same, then we can state our model should generalize
  2. If they are not the same, then we can compare the difference and hope to come up with a counter-example where our model fails to generalize
2. If we do not have a theory $T_{H}$ , then
  1. We want to generate counter-examples based on $γ$ where $T$ should fail. Then, we can see whether our model fails on these counter-examples.
  2. We can reason based on how simple $γ$ is. If $γ$ is complex, then maybe we have less confidence that model will generalize and vice-versa if $γ$ is simple, maybe we will have more confidence?
Knowledge extraction: Presume, the theory contains new insights $I$ which humans are not aware of. We want to extract these insights $I$ . Why would humans want to know these insights? I am a little confused here about the need and what kind of interpretability, we would like.
1. If humans understand $I$ then they can re-concile it with known human theory and hope to deduce new knowledge?
2. Will these new insights $I$ help humans verify / trust AI?
Verifiability: Given an input, we would like to verify whether the AI outputs the correct response.
1. If the task is easy to verify based on external metrics, then we shouldn't need interpretability?
2. If the task is hard to verify, but $γ$ makes it easier by allowing us to break down the task into easily verifiable subtasks.
3. Maybe, we can't verify the answer, but $γ$ leads us to believe the model is behaving differently on this input (not using $T$ described by $γ$ but some other theory) thus we should have low-confidence whether we can be sure that the model will behave correctly.
Edit theory T^[2]: We want to edit $T$ (change model behaviour) and be confident that the model will behave according to our new expectations. This seems like it can be achieved through fine-tuning (SFT / RLFT) so I am not sure why interpretability is needed here? Maybe, when we do (SFT / RLFT), we would want to know if the model changed beyond what we wanted to do. Let's say $γ$ was the description original model's theory, and $γ^{'}$ is the description for the new model's theory. Then, we can hope to compare $γ$ and $γ^{'}$ and catch unexpected changes.
Speeding up model computations^[2]: If we have an interpretation $γ$ , then we can edit the parts of the model which do not correspond to $γ$ . Thus, speeding up model computation by removing parts of model not required. However, it would be easier to just do pruning or quantization or other methods to achieve speeds rather than via interpretability. I do not view interpretability as helpful means towards this goal.
Predicting emergence of model capabilities: Imagine we are training a model. We are interested whether the model will solve a task if we keep training it. There is some evidence^[3] that external metrics such as cross-entropy does not predict future model capabilities for some tasks. van Rossem and Saxe^[4] interpret an RNN as automata and show the automaton states decrease when the model learns the task. However, it is unclear whether they can predict the task will be learnt apriori. Establishing this as a general

Do we need mechanistic interpretability?

Let's define mechanistic interpretability as interpreting the circuit trace to show the model arrived a logit prediction based on its token inputs. We hope to see which tokens from the input and what parts of the network were relevant to the model output. For relevant parts of the network, we want to generate an interpretation of what the module is doing and interpretation of the connection between different modules.

Does this mechanistic interpretability have an edge above other methods?

I think overall goal is to form some theory about what the model is doing. Based on this theory, reason about the statements we want to prove (such as generalization). It remains unclear whether mechanistic interpretability is better for forming theories than Chain-Of-Thought or using influence functions^[5] to identify relevant data to form theories.

Comparing theories formed using mechanistic interpretability to theories formed using other methods is a promising direction in my view. Particular benchmarks, I would be interested in:

Generalization: How do interpretability methods do in detecting Memorization / Generalizing neural networks? You have two networks that get 100% accuracy on your train set. Based on this, you have to predict whether they will generalize on your test set. Demonstrating this for a wide variety of tasks is a great result in my view. Or for existing benchmarks, can you reliably come up with adversarial examples of those tasks where the model fails (for example; pair of numbers that models fails to add / multiply).
Knowledge Extraction: Given a set of data (generated using some program or theory), we train models on the data (could just be few-shot prompting). Then we use different methods to retrieve knowledge rules and compare it against the original program / theory. I am not sure if interpretability will do better than asking / training the LLM to come up with a theory when given to a human / AI, solves the task.
Verifiability: I am not sure what a benchmark would look like here.
Edit theory T: Anthropic worked on this^[6], where the model's behaviour was altered and teams had to detect what was the possible change. It would be interesting if we can predict emergent misalignment^[7].

If you believe I have got something wrong or missed something, please reach out. You can comment down below or email me at prnaidu@ucsd.edu

^{^}
Language is defined as a set of strings. You can view $θ$ as a binary string corresponding to floating point numbers
^{^}
Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., ... & McGrath, T. (2025). Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496.
^{^}
Naidu, P., Wang, Z., Bergen, L., & Paturi, R. (2025). Quiet Feature Learning in Algorithmic Tasks. arXiv preprint arXiv:2505.03997
^{^}
van Rossem, L., & Saxe, A. M. Algorithm Development in Neural Networks: Insights from the Streaming Parity Task. In Forty-second International Conference on Machine Learning.
^{^}
Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., ... & Bowman, S. R. (2023). Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296.
^{^}
Marks, S., Treutlein, J., Bricken, T., Lindsey, J., Marcus, J., Mishra-Sharma, S., ... & Hubinger, E. (2025). Auditing language models for hidden objectives. arXiv preprint arXiv:2503.10965.
^{^}
Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., ... & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv preprint arXiv:2502.17424.

LESSWRONG
LW

LESSWRONG
LW

2

Why do Mechanistic Interpretability?

2

2

What is interpretability / explanation?

What statements do we want to make?

Do we need mechanistic interpretability?