What makes a neural network interpretable?

One response is that a neural network is interpretable if it is *human simulatable*. That is, it is interpretable if and only if a human could step through the procedure that the neural network went through when given an input, and arrive at the same decision (in a reasonable amount of time). This is one definition of interpretable provided by Zachary Lipton.

This definition is not ideal, however. It misses a core element of what alignment researchers consider important in understanding machine learning models. In particular, in order for a model to be simulatable, it must also be at a human-level or lower. Otherwise, a human would not be able to go step by step through the decision procedure.

Under this definition, a powerful Monte Carlo Tree Search would not be interpretable since that would imply that a human could beat an MCTS algorithm by simply simulating its decision procedure. So this definition appears to exclude things that we humans would consider to be interpretable, and labels them uninterpretable.

A slight modification of this definition yields something more useful for AI alignment. We could distinguish *decision simulatability* with *theory simulatability*. In decision simulatability, a human could step through the procedure of what an algorithm is doing, and arrive at the same output for any input.

In theory simulatability, the human would not necessarily be able to simulate the algorithm perfectly in their head, but they would still say that they algorithm is simulatable in their head, "given enough empty scratch paper and time." Therefore, MCTS is interpretable because a human could in theory sit down and work through an entire example on a piece of paper. It may take ages, but the human would eventually get it done; at least, that's the idea. However, we would not say that some black box ANN is interpretable, because even if the human had several hours to stare at the weight matrices, once they were no longer acquainted with the exact parameters of the model, they would have no clue as to why the ANN was making decisions.

I define theory simulatability as something like, the ability for a human to operate the algorithm given a pen and a blank sheet of paper after being allowed to study the algorithm for a few hours ahead of time. After the initial few hours, the human would be "taken off" the source of information about the algorithm, which means that they couldn't simply memorize some large set of weight matrices: they'd have to figure out exactly how the thing actually makes decisions.

Given a notion of theory simulatability, we could make our models more interpretable via a variety of approaches.

In the most basic approach, we limit ourselves to only using algorithms which have a well-understood meaning, like MCTS. The downside of this approach is that it limits our capabilities. In other words, we are restricted to using algorithms that are not very powerful in order to obtain the benefit of theory simulatability.

By contrast, we could try to alleviate this issue by creating small interpretable models which attempt to approximate the performance of large uninterpretable models. This method falls under the banner of *model compression*.

In a more complex ad-hoc approach, we could instead design a way to extract a theory simulatable algorithm that our model is implementing. In other words, given a neural network, we run some type of meta-algorithm that analyzes the neural network and spits out psuedocode which describes what the neural network uses to make decisions. As I understand, this is roughly what Daniel Filan writes about in Mechanistic Transparency for Machine Learning. Unfortunately, I predict that the downside of this approach is that it is really hard to do in general.

One way we can overcome the limitations of either approach is by analyzing transparency using the tools of regularization. Typically, regularization schemes have the intended purpose of allowing models to generalize better. Another way of thinking about regularization is that it is simply our way of telling the learning procedure that we have a preference for models in some region of model-space. Under this way of thinking, an penalty is a preference for models which are close to the origin point in model space. Whether this has the effect of allowing greater generalization is secondary to the regularization procedure; we can pose additional goals.

We can therefore ask, is there some way to put a preference on models that are interpretable, so that the learning procedure will find them? Now we have a concrete problem, namely, the problem of defining which parts of our model-space yield interpretable models.

Rather than thinking about model space in the abstract, I find it helpful to imagine that we first take a known interpretable algorithm and then plot how well that known algorithm can approximate the given neural network. If the neural network is not well approximated by any known interpretable algorithm, then we give it a high penalty in the training procedure.

This approach is essentially the approach that Mike Wu et al. have taken in their paper Beyond Sparsity: Tree Regularization of Deep Models for Interpretability. Their known algorithm is that of a *decision tree*. Decision trees are very simple algorithms -- they simply ask a series of yes-no questions about the data and return an answer in some finite amount of time. The full decision tree is the plot of all possible yes-no questions and the resulting leaves of decisions. The paper defines the complexity of any particular decision tree as the *average path length*, or the expected number of yes-no questions needed to obtain an answer, across input space. The more complex a decision tree needs to be in order to approximate the model, the less interpretable the model is. Specifically, the paper initially defines the penalty over the neural network parameters by the following algorithm

Since this penalty is not differentiable with respect to the model parameters, , we must modify the penalty to incorporate it while training. In order to define the penalty on general neural networks, Wu et al. introduce an independent surrogate neural network which estimates the above algorithm, while being differentiable. Therefore, the penalty for the neural network is defined by yet another neural network.

This surrogate neural network can be trained simultaneously with the base neural network trained to predict labels, with restarts after the model parameters have drifted sufficiently far in some direction. The advantage of simultanuous training and restarting is that it allows the surrogate penalty network to be well suited for estimating penalties for base networks near the one that it is penalizing.

According to the paper, this method produces neural networks that are competitive with state of the art approaches and therefore trade off little in terms of capacity. Perhaps surprisingly, these neural networks perform much better than simple decision trees trained on the same task, providing evidence that this approach is viable for creating interpretable models. Unfortunately, this approach has a rather crucial flaw: it is expensive to train; the paper claims that it nearly doubles the training time for a neural network.

One question remains: are these models simulatable? Strictly speaking, no. A human given the decision tree would still be able to get a rough idea of why the neural network was performing a particular decision. However, without the model weights, a human would still be forced to make an approximate inference rather than follow the decision procedure exactly. That's because after the training procedure, we can only extract a decision tree that approximates the neural network decisions, not extract a tree that perfectly simulates it. But this is by design: if we wanted perfect interpretability then we would be doing either model compression or mechanistic transparency anyway.

In my own opinion, the cognitive separation of *decision* and *theory* simulatability provides a potentially rich agenda for machine learning transparency research. Currently, most research that focuses on creating simulatable models, such as tree regularization, focus exclusively on decision simulatability. This is useful for present-day researchers because they just want a powerful method of extracting the reasoning behind ML decisions. However, it's not as useful for safety, because in the long term we don't really care that much about why specific systems made decisions, just as long as we know they aren't running any bad cognitive policies.

To be useful for alignment, we need something more powerful, and more general than tree regularization. Still, the basic insight of regularizing neural networks to be interpretable might be useful for striking a middle ground between building a model from the ground up, and analyzing it post-hoc. Is there a way to apply this insight to create more transparent models?