An Analogy for Interpretability

The ultimate chess playing Giant Look-Up Table is huge, but it has small Kolmogorov complexity.

(Or rather, it depends on some technical details. Sometimes there is more than one winning move; which one will the table generator choose? If there is a simple rule, such as "choose the move that comes first in the alphabet", then the Kolmogorov complexity of the table is low. But if the rule is "choose a random move", then the Kolmogorov complexity of the table is high, because it encodes not only the algorithm that created the table, but also all the random numbers that were used.)

I am not sure what the experts mean when they say "interpretability", but I would expect something like small Kolmogorov complexity to be involved.

Imagine newly baked interpretability researcher Namor, who claims that he solved interpretability. He says: “This is a bunch of matrices with a bunch of floating point values which are then multiplied, what’s the big deal?” His point of view might be considered myopic, but I think it might be deeper than it seems.

Zermelo’s theorem proves that in finite games with perfect information and two players who take turns there is a winning strategy for one player. It also provides an algorithm to find such a strategy: just build a whole game tree, then for each node which leads to only terminal nodes, make it itself terminal with an outcome that is best available for this node’s player, then repeat this until you reach a starting node. A game is considered solved when we have a winning strategy for one player.

Why then isn’t every finite game (like chess) solved? Because those game trees grow exponentially. There are 20 first possible chess moves, 400 first two possible moves, 8902 first three possible moves, and so on. No computing power in the universe can solve chess by brute-forcing. But I suggest considering it anyway: a program P running a Zermelo algorithm on an enormous computer. It will win (or lead to a draw) any other deep RL SGD AGI algorithm. This program is tiny: I just described it in one paragraph. I'm sure almost anybody could understand how it works. It is perfectly interpretable. Then why is it still winning against everyone?

The obvious answer is that it has much more computing power: its opponent just can’t see all of the moves this tiny program sees.

I suggest trying to look at this from another angle: this is not, in fact, a tiny understandable program that wins against any opponent, it is the Giant Look-Up Table (which our game tree is) that the opponent cannot comprehend which does the trick. And to actually interpret the inner workings of this program, they should somehow fit this whole tree in their brain.

II.

Why, then, should we expect interpretability to do us any help? Why wouldn't the resulting program be an incomprehensible mess of facts and heuristics, which somehow (and incomprehensibly) does the trick? My guess is that this is when simplicity bias comes in. When we train models, we have a regularization term: it punishes the model for having too large weights and steering it to be simpler and more general. You could say that this creates (or at least adds to) an attractor well for a model to have more general capabilities.

Turning back to Nomar, I think that his opinion points to a more general question: what do we even mean by a system being interpretable? If I memorize all the weights of GPT-2, can I claim that I interpreted it? My general intuition would say no (and I assume the consensus to be the same). But as we’ve seen, that is exactly what is needed to interpret the program P. Then, by this analogy, even if I disentangle all the matrices in nice and human-understandable circuits, it still might be possible that we wouldn’t have enough computing power in the brain to see the “disentangled version” and “win”.

As to why this question arises, my guess is that it is an artifact of how humans carve reality along its joints. It is possible for us to perfectly know the underlying rule and still be surprised by one example (in our case, the interpretable program is the rule and the specific way it wins against an opponent is an example), and a feeling of perfect interpretability is achieved when a person knows both the rule and the examples and can see the connection.

LESSWRONG
LW

LESSWRONG
LW

12

An Analogy for Interpretability

12

12

I.

II.