Any critique is welcome and highly encouraged. Please ask for clarifications if something does not make sense. Feel free to comment in this article or email me at prnaidu@ucsd.edu
Transformers are circuits we don't understand. The AI field leapt to find "interpretable" circuits from transformers to explain their behaviour. However, it is unclear what kind of explanations we want. We know the weights and we can manually calculate a transformer's forward pass. However, we still can't say much about the transformer even if we do so.
In this article, I want to expand on what types of statements we want to make about transformers. If possible, what kind of explanations would allow us to make those statements?
Say we have a parameterized model (for example a weather model) which makes predictions for some system W, where is from some language [1]. We can view as describing some theory . The model acts according to , if we understood , we would understand the model.
We say that we can interpret in another language , if we can describe with some . Furthermore, we should be able to use to make the same predictions as . Trivially, we can view as an interpretation of . Why is this interpretation not useful? What more do we want from our interpretation?
Presume, we want to make a statement (for example," makes predictions independent of the number of people living in a location"). If is a language along with some rules of inference, then we can use and reason whether statement is true. The reasoning becomes our explanation for why is true.
Since statement is true based on , we would want have high confidence that is true for . Since both and describe the same theory .
So, what languages and what rules of inference should we be interested in? This would depend on the statements we would like to make.
Let's define mechanistic interpretability as interpreting the circuit trace to show the model arrived a logit prediction based on its token inputs. We hope to see which tokens from the input and what parts of the network were relevant to the model output. For relevant parts of the network, we want to generate an interpretation of what the module is doing and interpretation of the connection between different modules.
Does this mechanistic interpretability have an edge above other methods?
I think overall goal is to form some theory about what the model is doing. Based on this theory, reason about the statements we want to prove (such as generalization). It remains unclear whether mechanistic interpretability is better for forming theories than Chain-Of-Thought or using influence functions[5] to identify relevant data to form theories.
Comparing theories formed using mechanistic interpretability to theories formed using other methods is a promising direction in my view. Particular benchmarks, I would be interested in:
If you believe I have got something wrong or missed something, please reach out. You can comment down below or email me at prnaidu@ucsd.edu
Language is defined as a set of strings. You can view as a binary string corresponding to floating point numbers
Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., ... & McGrath, T. (2025). Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496.
Naidu, P., Wang, Z., Bergen, L., & Paturi, R. (2025). Quiet Feature Learning in Algorithmic Tasks. arXiv preprint arXiv:2505.03997
van Rossem, L., & Saxe, A. M. Algorithm Development in Neural Networks: Insights from the Streaming Parity Task. In Forty-second International Conference on Machine Learning.
Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., ... & Bowman, S. R. (2023). Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296.
Marks, S., Treutlein, J., Bricken, T., Lindsey, J., Marcus, J., Mishra-Sharma, S., ... & Hubinger, E. (2025). Auditing language models for hidden objectives. arXiv preprint arXiv:2503.10965.
Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., ... & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv preprint arXiv:2502.17424.