Is Interpretability for Control or for Science?

James Enouen

I was interested to hear what more of the mech interp crowd thinks of this, so I am crossposting this here as well:
https://interpretability.blog/2025/07/22/interpretability-for-science-or-control/

...I will take this opportunity to give my personal and subjective experience of the “Actionable Interpretability” workshop at ICML 2025 and then just connect my thoughts with some of the larger trends in the field of interpretability.

I want to start off, however, with the end of this workshop, which had a panel like most. In one of the questions, the panel was asked:

Given that non-interpretability researchers often consider interpretability not that useful, how can we make our interpretability research more interesting to other fields, especially CV/NLP? In other words, are there interesting high-level directions which would allow us to better show the utility of our research?

One of the panelists gave the rather facetious answer that there was no need to convince anyone to join interpretability and that if you didn’t believe in the scientific understanding which interpretability promises, then “go add more GPUs or something”. This was of course met with a round of applause by the audience, before every panelist admitted it was more nuanced than that.

In this blog, I want to dig into that point a little harder, especially since the point of this year’s workshop was specifically designed to focus on the “actionability” of interpretability research, which seems to be in direct contradiction with this opinion.

In order to do this, I want to first bring up the recent direction pushed by Been Kim, an incredibly famous interpretability researcher and the opening talk of this workshop. She has recently pushed the direction of considering the “M” space of machine understanding and the “H” space of human understanding.

I think this is a very helpful framework to emphasize the goal of interpretability. In particular, there are some points inside of the “M – H” space which the machine ‘knows’ but the human does not. It is then the goal of interpretability to bring those points into the collective human understanding. This is exactly aligned with the goals of science.

In this most recent talk, Been Kim further stated that the direction going from M to H is what we would call “interpretability”, whereas the opposite goal of bringing the machine’s representation closer to a human understanding would be better called “controllability”, the most famous examples being the alignment problem and safety alignment problems.

It is in this framework, we can begin to see hints of actionability emerging. In particular, it is clear that controllability is one of these possible “downstream actions” which were the focus of the workshop. This is especially the case given that it is the motivation of many of the new researchers in interpretability (many of whom are inspired by mechanistic interpretability which is itself often motivated by the alignment problem). I feel it is accordingly worth trying to emphasize in this blog post how this differs from previous interpretability literature, to be able to distinguish these different motivations in subpopulations of the interpretability community, and to further understand how that leads to varying degrees of actionability. In particular, I believe this alignment motivation directly contrasts itself with what were the more classical downstream tasks of interpretability.

I, myself, work on generalized additive models (GAMs) for tabular datasets. Since I often work on them from a theoretical perspective, frankly, I often do not care at all about the actual actionability considerations. This is in spite of the fact that these actionability questions are something very dear to me. Note that these actions are less in the sort of ‘immediate action’ sense which was implied by this workshop, but more of the ‘exploratory data analysis’ sense which teases out problems as they arise.

For example, the largest applicability of additive models currently is in all likelihood the intelligible healthcare research spearheaded by Rich Caruana. Here the interpretable GAM model is able to detect some concerning trend like asthma patients being low risk for pneumonia mortality (a correlation rather than a causation). This is an absolutely critical insight into the medical dataset; moreover, I do not believe it is straightforward to automate these kinds of insights from the medical professional who helps interpret the GAM model. In fact, I believe this lack of experts is exactly the driving force behind lack of actionability in classical interpretability.

The final invited talk of the workshop, given by Eric Wong, explored exactly these kinds of classical downstream tasks, focusing on his collaborations with two experts in astronomy and three experts in healthcare. The specific applications he presented on were universe simulation with the detection of voids and clusters for cosmological constant prediction, and robust explaining and labeling of gallbladder surgery video for live surgeon assistance. It is my impression that it is hard to overstate the value of these expert collaborators in enhancing the actionability of Professor Wong’s research. Although other interpretability researchers may learn the actionability for these specific tasks from his papers associated with the talk, developing new actionability directions would require experts much like the collaborators of Prof. Wong.

It is for these reasons I wish to emphasize what I think is the great importance of experts for being able to focus on the “actionability” of interpretability, a topic I felt was critically underemphasized in the discussions at this workshop. To then summarize my rather strong opinions on this into three parts:

...

**sorry, not that sorry**, hot takes and last paragraphs can be found at the original location:

https://interpretability.blog/2025/07/22/interpretability-for-science-or-control/

LESSWRONG
LW

3

Is Interpretability for Control or for Science?

3

3

3

Is Interpretability for Control or for Science?

3

3