Léo Dana - Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, as well as an internship at FAR AI, both under the mentorship of Claudia Shi.

Would you trust the IOI circuit?

Interpretability is envisioned by many as the main source of alignment tool for future AI systems, yet I claim that interpretability’s Theory of Change has a central problem: we don’t trust interpretability tools. Why is that?

  • No proof of generalization: for interpretability, we have the same problem as for the models we studied, we don’t know if our tools will generalize (and we will likely never have proofs of generalization).
  • Security mindset: there is always something that could go wrong. If we don’t have proof, we rely on several assumptions that we can test, but nothing tells us that there is not a criterion we forgot to test or a critical environment in which the algorithm fails.
  • Distrust is fair: We already have examples of interpretability tools that were, in fact, not robust.
    • Saliency maps, a tool to visually observe which pixels or features of an image are important to the prediction, failed to generalize to simple transformations that didn’t affect the CNNs.
    • ROME and MEMIT use causal tracing to find and change some information’s location in LLMs. However causal tracing was proven to not find where the information is, and the editing method failed to generalize in some precise unseen contexts.
    • Recently, it was found that naive optimization for interpretable directions in an LLM could result in some Interpretability Illusions.


Although some people don’t trust interpretability, others seem to me to take for granted articles even when the authors admit the limitations in the robustness of their methods:

  • Chris Olah et al, in The Building Blocks of Interpretability: “With regards to attribution, recent work suggests that many of our current techniques are unreliable.”
  • Neel Nanda about Othello-GPT:  “How can we find the true probe directions, in a robust and principled way?”

Note that I’m not saying such findings are not robust, but their robustness was not completely evaluated.


I read Against Inerpretability and think that most critics hold because we cannot yet trust our interpretability tools. If we had robust tools, interpretability would be at least instrumentally useful to understand models.

Moreover, robustness is not unachievable! It is a gradient of how much arguments support you hypothesys. I think Anthropics’ paper Towards Monosemanticity and In-context Learning provide good examples of thorough research that clearly states the hypothesis, claims, and give several independent evidence of their results.

For those reasons, I think we should search what is a robust circuit, and evaluate how robust are the circuits we have found. Especially, while trying to find newer or more complex circuits, not after.


Interpretability Tech Tree

Designing an interpretability tool with all the important desiderata it needs to satisfy (robustness, scales to larger models, …) “first try” is very optimistic. Instead one can use a Tech Tree to know which desideratum to try next.

In Transparency Tech Tree, Hubinger details a path to follow to find a tool for transparency. A Tech Tree approximately says: “Once you have a proof of concept, independently search two improvements, and then combine them”, but also adds which improvement should be pursued, and in what order. Not only does this enable parallelization of research, but each improvement is easier since it is based on simpler models or data.

In Hubinger’s case, his tree starts with Best-case inspection transparency, which can be improved in Worst-case inspection transparency, or Best-case training process transparency, and finally combined with Worst-case training process transparency.

Example for robustness: Here is a 2-D table with side complexity, and robustness. New tools start as Proof of concept and the goal is to transform them into a Final product that one could use on a SOTA model. The easier path is to do *1 -> 2 * and 1 -> 3 in parallel, followed by 2 + 3 -> 4, combining tools rather than finding new ones.




complex circuit

2. Proof of complexity

4. Final product

easy circuit

1. Proof of concept

3. Proof of robustness





A Methodology for Interpretability

In general, interpretability lacks a methodology: a set of tools, principles, or methods that we can use to increase our trust in the interpretations we build. The goal is not to set certain ways to do research in stone, but to make explicit that we need to trust our research, and highlight bottlenecks for the community. The recent Towards Best Practices of Activation Patching is a great example of standards for activation patching![1] For example, they analyze which one of logit or probabilities (normalized logits) you should use to find robust results, and how this may impact your findings.

Creating such a methodology is not just about technical research, we also need to be clear on what is an interpretation, and what counts as proof. This part is much harder[2] since it involves operationalization of "philosophical" questionning.

Fortunately, researchers at Stanford are trying to answer those questions with the Causal Abstraction framework. It tries to formalize the hypothesis that “a network implements an algorithm”, and how to test it. So given our network, and an explicit implementation of the algorithm we think it implements, there exists a method to link their parameters and test how close our algorithm is to the network.


Personal research

At SERI MATS, my research[3] goal was to test the robustness of activation steering. The idea was to test the tradeoffs directions had to face on being good at:

  • Concept classification: from activations in the model, can the direction separate effectively those originating from male vs female?
  • Concept steering: given a gendered sentence, can you use the direction to make the model use the other gender?
  • Concept erasure: given a gendered sentence, can you use the direction to make the model confused about which gender to use in the rest of the text?

However, I chose to apply it to gender which turned out to be too complex/not the right paradigm. Using a dataset with the concept of utility, Annah and shash42 were able to test these tradeoffs and found interesting behavior: it is harder to remove a concept than to steer a model, and the choice of the direction needs to be precise to make the removal happen. I really liked their work and I think it is important to understand directions in LLMs!



I hope to have convinced you of the importance of Interpretability’s Robustness as a research strategy. And especially that the most efficient way to create such a methodology is not by serendipitously stumbling upon it.

Thanks for reading. I’ll be happy to answer comments elaborating on and against this idea. I will likely, by default, pursue this line of research in the future.


  1. ^

    It would have helped me a lot during my internship!

  2. ^

    The work that has already been done in Cognitive-Sciences and Philosophy might be useful in order to create this methodology.

  3. ^

    It is not available at the moment, I will plug a link when done with it.

New Comment