18

Mechanistic Interpretability is Being Pursued for the Wrong Reasons

4th Jul 2023

Linkpost from colewyeth.com

8 min read

18

18

Mechanistic Interpretability is Being Pursued for the Wrong Reasons

New Comment

1 comment, sorted by

Click to highlight new comments since: Today at 1:25 AM

[-]Morpheus3mo10

Thanks for writing this post! I do think some people are pursuing interp for this "wrong reason" of trying to prevent scheming, and the road where you get interp to improve that well to make that work seems unlikely (understanding general circuits doesn't seem impossible to me but extremely hard and nonzero people are working on this).

I think that perhaps the mistake comes from mistaking the simplicity of the optimizer for a property of the mesa-optimizer. SGD by backprop is one algorithm so people put a single label, "deep learning," on all models it produces. But there is no reason that all of these models must use similar circuits. They may all use an array of unique fantastically complex circuits. Understanding every circuit that can be produced by SGD at once is not a cohesive research program, and it is not a plan that will succeed.

There would be reasons to believe that models are going to use similar algorithms if they use similar training data. Understanding every circuit that is possible to be produced by SGD given "infinite training data" seems intractable, but "in practice" I'd expect different algorithms produced by SGD to produce modular structures with common "motifs" just like evolution does. Evolutionary developmental biology is indeed a field (that just like interp seems more bottlenecked on better theory rather than measurement capabilities). It's why I am still excited about developmental interpretability, even though I don't have a coherent plan for how it will help us with safety beyond "more theory and foundations" seems nice (with the general caveat that I am very confused about capability externalities, but this seems kind of unavoidable for actually broad insights).

More from Cole Wyeth

Curated and popular this week

The success of prosiac A.I. seems to have generally surprised the rationalist community, including myself. Rationalists have correctly observed that deep learning is a process of mesa-optimization; stochastic gradient descent by backpropogation finds a powerful artifact (an artificial neural net or ANN) for a given task. If that task requires intelligence, the artifact is often successfully embued with intelligence. Some rationalists seem to have updated in the direction that there is no simple algorithm for intelligence, for instance giving Hanson's worldview credit for predicting this better (to be clear I do not know if the author actually endorses this viewpoint).

There is an argument to be made that the dependence of deep learning on huge datasets shows that intelligence is an accumulation of knowledge about the world and little useful algorithms. An example line of reasoning is as follows: There is not one central learning algorithm that bootstraps itself to superintelligence. All such attempts, like Eurisko, have failed. Instead, modern deep learning algorithms succeed through familiarity with a huge number of examples compressed among the weights to form a strong background understanding of the world, and many locally useful sub-algorithms are found to perform well on different types of examples, with the results aggregated to reach high overall performance. Note that the former claim is along the lines of the Cyc philosophy (with a more flexible representation) and the later is sometimes expressed by Robin Hanson; I think they are distinct but connected ideas.

I will argue that this viewpoint is not necessarily the correct takeaway, and apparently arises from a misinterpretation of the reasons for the success of deep learning. Further, this misinterpretation has led mechanistic interpretability to be pursued for the wrong reasons and with the wrong objectives. These problems in combination have corrupted the content of lesswrong.

My model of deep learning is that neural networks are circuits in the sense of theoretical computer science. That is, to understand trained models, forget the details of neural circuits in the brain. Forget the analogue nature of neural networks. Those are unnecessary complications. Floating point numbers have finite precision, and artificial neural nets are actually more efficient with similar performance when that precision is decreased (sometimes to the point of modifying floating point operations to similar operations on integers). If the approximately analogue nature of artificial neural nets were important for their performance, we would not see that trend. Deep learning learns boolean circuits. The analogue approximation is important for differentiability, so that viewpoint should not be discarded, but it is part of the optimizer, not the mesa-optimizer.

It is important to separate our thinking of about the optimizer and the mesa-optimizer. Imagine a training model as a block of clay. The optimizer is a blind sculptor with many hands, who always pushes and bends the clay greedily towards a better shape, without knowing what it is that he is trying to make. When training stops, the clay is fired, and the result is an intelligent artifact: the mesa-optimizer.

The success of deep learning shows two somewhat related facts:

First, there are smart, practical size circuits (good mesa-optimizers) for many problems we care about. Now, a circuit with a billion gates is not very small, and does not necessarily represent a very simple algorithm. But through pruning, a smaller network can be found within it. And perhaps there are symmetries within these circuits; simple algorithms roll out to pretty large circuits when restricted to a fixed input size. The number of gates in a rolled out circuit scales with the runtime and space as well as the number of states of a Turing machine. Even if neural nets were learning fairly simple algorithms, great depth may still be required to run these algorithms to completion. Current ANNs tend to have the order of hundreds to thousands of layers, which would correspond to an algorithm running for (roughly) the same number of steps. The number of neurons in each layer, though I believe this varies more by application, is on a similar scale. A rough meta-order of magnitude estimate indicates to me that this is around the number of gates one would need to naively roll out a TM for a Python script of perhaps a hundred lines running for perhaps a hundred steps. The fact that we are only starting to see really impressive behavior around our current level of scale therefore does little to rule out the existence of simple algorithms for intelligence. On balance, it seems to indicate that there are intelligent algorithms with moderate complexity (in terms of both length and runtime). I would be very interested to see a more careful, explicit analysis. However, it must be noted that there may be efficient circuits at a fixed input size even where there are no efficient algorithms, which somewhat spoils the usefulness of making these estimates precise.

Second, SGD is powerful enough to find some of these smart circuits (good optimizer). The second fact is not as important for understanding cognition as it is normally made out to be. The optimizer does not learn in the way that humans learn. It learns slowly, from a huge number of examples. It never learns to learn more efficiently. It resembles evolution more than it resembles a human mind, so the workings of the optimizer say little about the workings of the human mind.

It is true that the current best way to find intelligent artifacts is through a blind optimization process at scale, but this does not means that intelligent artifacts are a blind optimization process at scale. It also doesn't mean that there is no simple algorithm for intelligence. It perhaps provides a little Bayesian evidence in that direction, because if there were very simple algorithms for intelligence, we likely would have found them by other means (e.g. by trying many intuitively appealing short programs or inventing them out of decision theory), whereas SGD on enough parameters might be expected to find both small and relatively large algorithms for intelligence. But this is not very strong evidence because it is quite plausible that we would not have invented small to medium sized algorithms for intelligence by hand. In fact, the existence of many such algorithms would actually go a long way to explaining the success of deep learning.

I believe that there is some truth to the argument that A.I. systems need a lot of knowledge to get off of the ground. In particular, I think a strong inductive bias is necessary to perform well in our world. This is why AIXI has no nontrivial optimality guarantees: in worlds where it is impossible to recover from some mistakes, you need to start out with beliefs in the ballpark of the truth to you don't saw your own arm off to see how your muscles work. However, in deep learning, compressed knowledge and inference algorithms get mixed together. I suggest reading this explanation from Yoshua Bengio before continuing. It is possible that entagling knowledge and inference is essential to the success of deep learning, and if so, my thesis in this post would probably be wrong. However I contend that though the inference engine of course needs to be trained alongside the knowledge base so that knowledge is represented in a useable form, Yoshua Bengio's approach can be made to work, and the inference algorithms discovered by deep learning are simple and not themselves strongly tied to big data.

As is widely accepted by rationalists in regards to alignment, we should not assume that properties of the optimizer are shared by the mesa-optimizer.

I think that this point has been missed, which has led to subtle mistakes in reasoning about ANNs. In particular, the field of mechanistic interpretability seeks to understand the inner working of ANNs by decoding and explaining the circuits of trained models. The hope is that by inventing techniques for understanding ANNs, we will be able to "read the thoughts" of both existing and future A.I. systems. This is sometimes referred to as "interpretability gives you a mulligan". The problem with this idea is that it will not work in practice. Understanding general circuits is simply too hard.

To give you an idea of how hard understanding general circuits is, I will refer to one of my textbooks on computational complexity, Arora and Barak. In particular, let us consider circuit lower bounds, the field of computational complexity which attempts to prove what problems small circuits CAN'T solve. The chapter on circuit lower bounds is titled: "Circuit Lower Bounds: Complexity Theory's Waterloo." In case you need a history lesson, Waterloo was a battle where Napoleon did NOT win. And in fact, I am not aware of any superlinear lower bounds on (general) circuit complexity for any problem, after decades of attempts. This means that either humans are terrible at proving things about circuits, or there are damn good circuits for a lot of problems and we aren't smart enough to find them. I tend to suspect that both are true. Clearly neither is a good sign for our ability to understand circuits produced by an optimizer which is not trying to make them easy for us to understand, and which might not even be correct in all cases.

Constructing a general approach for understanding arbitrary circuits sounds, on the face of it, like an undecidable problem. Worse, even understanding MOST smallish circuits looks unlikely to me. It seems to be at least as hard as understanding everything about arithmetic mod 2, which can also encode arithmetic mod p though I believe the efficiency of that encoding isn't exactly settled. I don't think this is likely to be achieved any time soon, or at all. Certainly not soon enough to be relevant to A.I. alignment. The amount of mathematics behind the smallest circuits for arbitrary problems seems to be massive.

I think that perhaps the mistake comes from mistaking the simplicity of the optimizer for a property of the mesa-optimizer. SGD by backprop is one algorithm so people put a single label, "deep learning," on all models it produces. But there is no reason that all of these models must use similar circuits. They may all use an array of unique fantastically complex circuits. Understanding every circuit that can be produced by SGD at once is not a cohesive research program, and it is not a plan that will succeed.

Mechanistic interpretability is still an important research direction because we don't need to understand every ANN to learn something about intelligence. We can learn useful algorithms for inference, ones we could not have easily invented by hand, by studying just one ANN that exhibits behavior we wouldn't have been able to program ourselves. For instance, GPT-4 or AlphaGo, though perhaps smaller models are a better start. We don't even need to understand everything going on inside these ANNs. We just need to identify the circuits doing inference or planning and reverse engineer them. It is not an easy goal but I believe it may be realistically achievable.

In other words, study particular mesa-optimizers. Do not try to study all mesa-optimizers, and do not expect investigations of the optimizer to yield much insight into the mesa-optimizer.

By extracting insight on inference algorithms from ANNs, we can conceivably improve our own cognition. This was the original goal of lesswrong: to intentionally cultivate rationality. It has been somewhat drowned out by empirical investigation of transformers, LLMs, etc., and intuitive attempts to understand their inner workings as if there were a single coherent thing to understand. But perhaps we should return to directly studying rationality, with these new models as tools instead of distractions. This will give us a chance of building similarly powerful artifacts by hand instead of by blind optimization, which we will be able to understand and perhaps align.