The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Makes me think about what precisely we want from interpretability, since that seems relevant in grounding it.

As a first go at the answer, it seems to me that we want to know what assumptions the model relies on, in order to give correct results, as well as what properties are satisfied by intermediate results in the network's calculations. This is useful because we can then look at under what conditions these assumptions are known to fail or succeed.

For instance with the cat classification, why would we want to learn that it "has learned to use ears and whiskers to identify cats"? Well, we might translate this into the assumption "something is a cat iff it has cat ears and whiskers". There's two ways this could fail: if some non-cat entity has cat ears and whiskers (for instance because someone is trying to trick the network), or some cat does not exhibit cat ears and whiskers (for instance maybe they are hidden by the camera angle). These failures tell us important things about what would happen if we used the network for high-stakes real-world decisions.

However, there are a number of problems with my stab at the goal of interpretability. First, there will often be multiple different sets of conditions that can make a model valid, rather than a single unique one. Second, the conditions themselves will in reality be so incredibly complex and plentiful that they need interpretation. Third, there is an actual answer to what assumptions the network makes, and it is a useless answer: it makes the assumptions that the data fed into the network is IID with the same distribution as the training data and that the cost of misclassifications is proportional to the loss.

I think these flaws point towards that when we do interpretability, we more want to impose some structure on the network. That is, we want to find some set of conditions that could occur in reality, where we can know that if these conditions occur, the network satisfies some useful property (such as "usually classifies things correctly").

The main difficulty with this is, it requires a really good understanding of reality? This is something humans sometimes have good ability to figure out, but it is hard to make an automated tool for this. In fact it seems like an AGI-complete problem: understand in depth how reality works and distill this into a small set of useful conditions that can be applied to deduce things about how systems perform across a wide variety of conditions.

... 😅 which might be indicative that I have made an overly strong requirement somewhere along the line, because surely one can do useful interpretability without an AGI.

[-]tailcalled3y70

[-]johnswentworth3y43

I think these flaws point towards that when we do interpretability, we more want to impose some structure on the network. That is, we want to find some set of conditions that could occur in reality, where we can know that if these conditions occur, the network satisfies some useful property (such as "usually classifies things correctly").
The main difficulty with this is, it requires a really good understanding of reality?

There we go!

So, one item on my list of posts to maybe get around to writing at some point is about what's missing from current work on interpretability, what bottlenecks would need to be addressed to get the kind of interpretability we ideally want for application to alignment, and how True Names in general and natural abstraction specifically fit into the picture.

The OP got about half the picture: current methods mostly don't have a good ground truth. People use toy environments to work around that, but then we don't know how well tools will generalize to real-world structures which are certainly more complex and might even be differently complex.

The other half of the picture is: what would a good ground truth for interpretability even look like? And as you say, the answer involves a really good understanding of reality.

Unpacking a bit more: "interpret" is a two-part word. We see a bunch of floating-point numbers in a net, and we interpret them as an inner optimizer, or we interpret them as a representation of a car, or we interpret them as fourier components of some signal, or .... Claim: the ground truth for an interpretability method is a True Name of whatever we're interpreting the floating-point numbers as. The ground truth for an interpretability method which looks for inner optimizers is, roughly speaking, a True Name of inner optimization. The ground truth for an interpretability method which looks for representations of cars is, roughly speaking, a True Name of cars (which presumably routes through some version of natural abstraction). The reason we have good ground truths for interpretability in various toy problems is because we already know the True Names of all the key things involved in those toy problems - like e.g. modular addition and Fourier components.

LESSWRONG
LW

LESSWRONG
LW

27

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

27

27