What makes a neural network interpretable?

One response is that a neural network is interpretable if it is *human simulatable*. That is, it is interpretable if and only if a human could step through the procedure that the neural network went through when given an input, and arrive at the same decision (in a reasonable amount of time). This is one definition of interpretable provided by Zachary Lipton.

This definition is not ideal, however. It misses a core element of what alignment researchers consider important in understanding machine learning models. In particular, in order for a model to be simulatable, it must also be at a human-level or lower. Otherwise, a human would not be able to go step by step through the decision procedure.

Under this definition, a powerful Monte Carlo Tree Search would not be interpretable since that would imply that a human could beat an MCTS algorithm by simply simulating its decision procedure. So this definition appears to exclude things that we humans would consider to be interpretable, and labels them uninterpretable.

A slight modification of this definition yields something more useful for AI alignment. We could distinguish *decision simulatability* with *theory simulatability*. In decision simulatability, a human could step through the procedure of what an algorithm is doing, and arrive at the same output for any input.

In theory simulatability, the human would not necessarily be able to simulate the algorithm perfectly in their head, but they would still say that they algorithm is simulatable in their head, "given enough empty scratch paper and time." Therefore, MCTS is interpretable because a human could in theory sit down and work through an entire example on a piece of paper. It may take ages, but the human would eventually get it done; at least, that's the idea. However, we would not say that some black box ANN is interpretable, because even if the human had several hours to stare at the weight matrices, once they were no longer acquainted with the exact parameters of the model, they would have no clue as to why the ANN was making decisions.

I define theory simulatability as something like, the ability for a human to operate the algorithm given a pen and a blank sheet of paper after being allowed to study the algorithm for a few hours ahead of time. After the initial few hours, the human would be "taken off" the source of information about the algorithm, which means that they couldn't simply memorize some large set of weight matrices: they'd have to figure out exactly how the thing actually makes decisions.

Given a notion of theory simulatability, we could make our models more interpretable via a variety of approaches.

In the most basic approach, we limit ourselves to only using algorithms which have a well-understood meaning, like MCTS. The downside of this approach is that it limits our capabilities. In other words, we are restricted to using algorithms that are not very powerful in order to obtain the benefit of theory simulatability.

By contrast, we could try to alleviate this issue by creating small interpretable models which attempt to approximate the performance of large uninterpretable models. This method falls under the banner of *model compression*.

In a more complex ad-hoc approach, we could instead design a way to extract a theory simulatable algorithm that our model is implementing. In other words, given a neural network, we run some type of meta-algorithm that analyzes the neural network and spits out psuedocode which describes what the neural network uses to make decisions. As I understand, this is roughly what Daniel Filan writes about in Mechanistic Transparency for Machine Learning. Unfortunately, I predict that the downside of this approach is that it is really hard to do in general.

One way we can overcome the limitations of either approach is by analyzing transparency using the tools of regularization. Typically, regularization schemes have the intended purpose of allowing models to generalize better. Another way of thinking about regularization is that it is simply our way of telling the learning procedure that we have a preference for models in some region of model-space. Under this way of thinking, an penalty is a preference for models which are close to the origin point in model space. Whether this has the effect of allowing greater generalization is secondary to the regularization procedure; we can pose additional goals.

We can therefore ask, is there some way to put a preference on models that are interpretable, so that the learning procedure will find them? Now we have a concrete problem, namely, the problem of defining which parts of our model-space yield interpretable models.

Rather than thinking about model space in the abstract, I find it helpful to imagine that we first take a known interpretable algorithm and then plot how well that known algorithm can approximate the given neural network. If the neural network is not well approximated by any known interpretable algorithm, then we give it a high penalty in the training procedure.

This approach is essentially the approach that Mike Wu et al. have taken in their paper Beyond Sparsity: Tree Regularization of Deep Models for Interpretability. Their known algorithm is that of a *decision tree*. Decision trees are very simple algorithms -- they simply ask a series of yes-no questions about the data and return an answer in some finite amount of time. The full decision tree is the plot of all possible yes-no questions and the resulting leaves of decisions. The paper defines the complexity of any particular decision tree as the *average path length*, or the expected number of yes-no questions needed to obtain an answer, across input space. The more complex a decision tree needs to be in order to approximate the model, the less interpretable the model is. Specifically, the paper initially defines the penalty over the neural network parameters by the following algorithm

Since this penalty is not differentiable with respect to the model parameters, , we must modify the penalty to incorporate it while training. In order to define the penalty on general neural networks, Wu et al. introduce an independent surrogate neural network which estimates the above algorithm, while being differentiable. Therefore, the penalty for the neural network is defined by yet another neural network.

This surrogate neural network can be trained simultaneously with the base neural network trained to predict labels, with restarts after the model parameters have drifted sufficiently far in some direction. The advantage of simultanuous training and restarting is that it allows the surrogate penalty network to be well suited for estimating penalties for base networks near the one that it is penalizing.

According to the paper, this method produces neural networks that are competitive with state of the art approaches and therefore trade off little in terms of capacity. Perhaps surprisingly, these neural networks perform much better than simple decision trees trained on the same task, providing evidence that this approach is viable for creating interpretable models. Unfortunately, this approach has a rather crucial flaw: it is expensive to train; the paper claims that it nearly doubles the training time for a neural network.

One question remains: are these models simulatable? Strictly speaking, no. A human given the decision tree would still be able to get a rough idea of why the neural network was performing a particular decision. However, without the model weights, a human would still be forced to make an approximate inference rather than follow the decision procedure exactly. That's because after the training procedure, we can only extract a decision tree that approximates the neural network decisions, not extract a tree that perfectly simulates it. But this is by design: if we wanted perfect interpretability then we would be doing either model compression or mechanistic transparency anyway.

In my own opinion, the cognitive separation of *decision* and *theory* simulatability provides a potentially rich agenda for machine learning transparency research. Currently, most research that focuses on creating simulatable models, such as tree regularization, focus exclusively on decision simulatability. This is useful for present-day researchers because they just want a powerful method of extracting the reasoning behind ML decisions. However, it's not as useful for safety, because in the long term we don't really care that much about why specific systems made decisions, just as long as we know they aren't running any bad cognitive policies.

To be useful for alignment, we need something more powerful, and more general than tree regularization. Still, the basic insight of regularizing neural networks to be interpretable might be useful for striking a middle ground between building a model from the ground up, and analyzing it post-hoc. Is there a way to apply this insight to create more transparent models?

I endorse this as a description of how I currently think about mechanistic transparency, although I haven't reread the post (and imagine finding it somewhat painful to), so can't fully endorse your claim.

That seems like a important feeling to pay attention to, as a signal that you should update the post with new insights. :)

Update: I reread the post (between commenting that and now, as prep for another post currently in draft form). It is better than I remember, and I'm pretty proud of it.

Presumably if the extraction procedure is good enough, then the decision tree gets about as much accuracy as the neural network, and if inference times are similar, then you could just use the decision tree instead, and think of this as a neat way of training decision trees by using neural networks as an intermediate space where gradient descent works nicely.

I don't think that's quite true. It's possible in theory that your training procedure discovers a radically simple way to model the data that the human wasn't aware of, and the human can easily step through this radically simple model once it's discovered. In which case the model could be characterized as both super-human and simulatable. Of course, deep learning doesn't work this way.

The paper says:

How to interpret what the neural network is doing the rest of the time?

Do we need to know that, or is 85-90% agreement with a decision tree proxy good enough? I guess it depends on how interpretability is going to be used to achieve overall safety. Do you have a view on that?

In the case of tree regularization, it's hard to see how we could use it to benefit alignment. Luckily, I'm not endorsing tree regularization as a specific safety strategy.

Instead, I see it more like this: how can we encourage our models to be easy to inspect? In a way, it's not clear that we need to actually extract the exact penalty-inducing algorithm (that is, the known-algorithm which is used to generate the penalty term). We just want our model to be doing the type of things that would make it easier rather than harder to peek inside and see what's going on, so if something starts going wrong, we know why.

The hope is that including a regularization penalty can allow us to create such a model without sacrificing performance too heavily.

To generalize my question, what if something goes wrong, we peek inside and find out that it's one of the 10-15% of times when the model doesn't agree with the known-algorithm which is used to generate the penalty term?

I interpreted your question differently than you probably wanted me to interpret it. From my perspective, we are hoping for greater transparency as an end result, rather than treating it as "similar enough" to some other algorithm and using the other algorithm to interpret it.

If I wanted to answer your generalized question within the context of comparing it to the known algorithm, I'd have to think for much longer. I don't have a good response on hand.

I'm not sure what distinction you're drawing here - in both cases, you can simulate the algorithm in your head given enough scratch paper and time. To me, the natural distinction between the two is compressibility, not simulability, since all algorithms that can be written in standard programming languages can be simulated by a Turing machine, which can be simulated by a human with time and scratch paper.

Also, to note: simulatability is not my word here. I agree that every algorithm is simulatable in principle, but I am using a different usage of simulatable than mere Turing simulatable. Perhaps a better phrase is practically simulatable.

If you were told that you could study the weight matrices of a neural network for one hour, and then after that hour you'd be given pen and paper and a bunch of testing data, I don't think you'd be able to classify images well (unless the network was very small and so you could memorize the weights). By contrast, one can study MCTS for one hour and can use it to play games. That's the relevant distinction, though I perhaps phrased it poorly.

I agree that this is similar to compressibility. I will have to think longer for whether they are similar enough for a distinction to be meaningless.

Ah, gotcha. I think this is a bit different to compressibility: if you formalise it as Kolmogorov complexity, then you can have a very compressible algorithm that in fact you can't compress given time limits, because it's too hard to figure out how to compress it. This seems more like 'de facto compressibility', which might be formalised using the speed prior or a variant.

Looking back on this, the relevant notion isn't going to be those distributions, but just plain old Kt complexity: the minimum over programs p that take time t to compute the data of len(p)+log(t).

I'm still not sure about the distinction. A human with an arbitrarily large amount of time & paper could "train" a new NN (instead of working to "extract a decision tree"), and then "use" that NN.

Unless the human could memorize the initialization parameters, they would be using a different neural network to classify. In other words, the training procedure might be simulatable, but not the actual neural networks.

Rather than thinking about simulatability in a Turing sense, ask whether it makes sense in an informal sense: would you know what to do in order to operate the algorithm on paper.

Why wouldn't the "human-trained network" be identical to the "original network"? [EDIT: sorry, missed your point. If the human knows the logic of the random number generator that was used to initialize the parameters of the original network, they can manually run the same logic themselves.]

By the same logic, any decision tree that is too large for a human to memorize does not allow theory simulatability as defined in the OP.

Presumably the random seed is going to be big and complicated.

It also wouldn't just be the random seed you'd need to memorize. You'd have to memorize all the training data and labels too. The scenario I gave was intended to be one where you were asked to study an algorithm for a while, and then afterwards operate it on a

blanksheet of paper, given testing data. Even if you had the initialization seed, you wouldn't just be able to spin up neural networks to make predictions...(just noting that same goes for a decision tree that isn't small enough s.t. the human can memorize it)

Yes, exactly. That's why the authors of tree regularization put large penalties for large trees.

I have little to say except that I didn't want people to take this definition too seriously and now it looks like I've done a bad job communicating what I meant. In my head I have a simple notion corresponding to "could I operate this algorithm on a piece of paper if I was asked to study it for a while beforehand." This is different in my mind from asking whether it is possible to do so on a Turing machine.

But like, I could not operate a large decision tree on a piece of paper if I could study it for a while beforehand, because I wouldn't remember all of the yes/no questions and their structure.

I could certainly build a decision tree given data, but I could also build a neural network given data.

(Well, actually I think large decision trees and neural nets are both uninterpretable, so I mostly do agree with your definition. I object to having this definition of interpretability

andbelieving that decision trees are interpretable.)Yes, I wrote in the main post that the authors of tree regularization penalized large trees. Certainly small decision trees are interpretable, and large trees much less so.

Oh, sure, but if you train on a complex task like image classification, you're only going to get large decision trees (assuming you get decent accuracy), even with regularization.

(Also, why not just use the decision tree if it's interpretable? Why bother with a neural net at all?)

The hope is that we can design some regularization scheme that provides us the performance of neural networks while providing an interpretation that is understandable. Tree regularization does not do this, but it is one approach that uses regularization for interpretability.

If you have access to the training data, then DNNs are basically theory simulatable, since you can just describe the training algorithm and the initialization scheme. The use of random initialization seems like an obstacle, but we use pseudo-random numbers and can just learn the algorithms for generating those as well.

Yes, we would say that DNNs are theory simulatable given the training data. But that's just another way of saying that the training procedure is transparent. We really care about understanding the model, not just the training process.

Also see another comment on here where I clarified that this defintion is not intended to be precise. It might be better described as "practically simulatable."