In his List of Lethalities, Eliezer writes that "matrices [in neural networks] are opaque". But I think our understanding of how neural networks actually work has improved significantly last year, with the deep learning theory book. I will briefly outline the general idea of the book below, in case closing this knowledge gap inspires any viable path towards alignment.

Everything that follows is paraphrased from Roberts, Yaida, and Hanin's book, unless I indicate otherwise. To understand the book, you need decent linear algebra and analysis skills, with a sprinkle of information theory. Most of the difficulty comes from the length of the calculations, which requires some practice to get used to. Luckily, the book is exceptionally well written and guides the reader through the math step by step.

## Effective theories

Artificial neural networks are usually described in terms of activations, weights, and biases. This is the "atomic" view. But just like describing a gas in terms of the individual atoms that it is made of, the atomic view of neural networks is too fine-grained for most practical purposes. To understand neural networks, we need to know the *effective degrees of freedom* of the network - akin to temperature, pressure, and volume of a gas. This is not just practical, but it is essential for a true understanding of anything. You haven't fully understood a gas until you have derived the ideal gas law - even if you know all about the quantum field theory that ultimately implies it.

A beautiful thing about nature is that effective theories work. While nature always operates on the lowest level, most of these low-level things often don't matter for the high-level description. Physicists use perturbation theory and "renormalisation group flow" to successfully link the low-level description (e.g. atoms) with the high-level one (e.g. a gas). This not only gives you better concepts to work with at large scales, but also tells you *when those concepts will break down and what to do in this case*. The idea in the book is to apply the physicist's tools to find an effective theory description of neural networks.

## The ensemble

The weights and biases of a neural network are typically initialised randomly before training. During training, all of these parameters are updated (e.g. step by step via gradient descent) until the network produces a desired output from a given input. The function that the trained network represents therefore usually depends on the particular initialisation that we started with^{[1]}. To fully understand neural networks, it is essential to think in terms of an ensemble of networks over all possible initialisations, instead of a single trained network.

So we are dealing with a probability distribution over network outputs (those of the final layer or any intermediate ones), given an input and an initialisation distribution. To build an effective theory of a feed-forward network, Roberts et al. mainly do three things:

- Start with the infinite layer width limit and then do perturbation theory on the width to get to large but finite width
- Use the recursion relation between layers
- Marginalise over the parameter initialisation distribution (whenever this is desirable)

All the math is pretty standard in physics, and as far as I can tell it all checks out (but I haven't done all the calculations myself, yet). What we get in the end is an effective description of a feed-forward network that lets you derive all kinds of interesting things.

## Over-parametrisation and generalisation

Modern deep neural networks are usually trained to convergence, where the training error vanishes. Here, we are working in the over-parametrised regime, where the network has many more parameters than are necessary to describe the training data. (See pp. 391 of the book for why this is not in conflict with Occam's razor.) As a result, the loss function (what we try to minimise during training) does not have a single global minimum, but a high-dimensional sub-manifold of global minima. The art of building and training neural networks lies in finding not just any global minimum of the loss function, but the global minimum that *also minimises the error on the unseen test data*.

Roberts et al. express this error on the test set explicitly with the effective theory and show how it is affected by the choice of initialisation and learning algorithm. Specifically, they show how an object called the neural tangent kernel is the main driver of the function-approximation dynamics and how its components determine the generalisation behaviour of the model. They also show explicitly how and why representation learning works.

The theory described in the book is derived for feed-forward and residual networks, but the same techniques should apply to transformers and any other kind of network that has some recursive / modular structure. Doing this is a lot of hard work, but it can be fun if you like physics.

## Dealing with the distributional shift

The question is now if we can express the change of a network's behaviour under distributional shift of the input data. If we can do this, maybe we can say something that might help with alignment. Having said this, such a research direction is probably advancing capability more than alignment, as usual.

^{^}In the over-parametrised, non-convex regime that we are interested in.

I suspect you use the word "opaque" in a different way than Eliezer Yudkowsky here. At least I fail to see from your summary, how this would contradict my interpretation of Eliezer's statement (and your title and introduction seems to imply that it is a contradiction).

Consider the hypothetical example, where GPT-3 states (incorrectly) that Geneva is the capital of Switzerland. Can we look at the weights of GPT-3 and see if it was just playing dumb or if it genuinely thinks that Geneva is the capital of Switzerland? If the weights/"matrices"/"giant wall of floating point numbers" are opaque (in the sense of Eliezer according to my guess), then we would look at it and shrug our shoulders. I fail to see from your summary, how the effective theories would help in this example. (Disclaimer: In this specific example or similar examples, I would not be surprised if it was actually possible to figure out if it was playing dumb or what caused GPT-3 to play dumb. Also I do not expect GPT-3 to actually believe that Geneva is the capital of Switzerland).

My guess of your meaning of "opaque" would be something like "we have no idea why deep learning works at all" or "we have no mathematical theory for the training of neural nets", which your summary disproves.

Hmm, you may be right, sorry. I somehow read the opaqueness problem as a sub-problem of lie-detection. To do lie-detection we need to formulate mathematically what lying means, and for that we need theoretical understanding of what's going on in a neural net in the first place, so we have the right concepts to work with.

I think lie-detection in general is very hard, although it might be tractable in specific cases. The general problem seems hard because I find it difficult to define lying mathematically. Thinking about it for five minutes I hit several dead ends. The "best" one was this: If the agent (for lack of a better term) lies, it would not be surprised about a contrary outcome. That is, I think it would be a bad sign if the agent wasn't surprised to find me dead tomorrow, despite stating the contrary. And surprisal is something that we have an information-theoretical handle on. However, even if we could design the agent such that we can feed it with such input that it actually "believes" it is tomorrow and I am dead (even though it is today and I am still alive), we would still need to distinguish surprisal about the fact that I'm dead and surprisal about the way the operator has formulated the question or any other thing. (A clever agent might expect the operator to ask this question and deliberately forget that one can ask the question in this particular way, so it'd be surprised to hear this formulation, etc.) The latter issue might become more tractable now that we better understand how and why representations are forming, so we could potentially distinguish surprisal about form and surprisal about content. I still see this as a probable dead end because of the "make it believe" part. If a solution exists, I expect it to be specific to a particular agent architecture.

I would count that as substantial progress on the opaqueness problem.

To be clear: I don't have strong confidence that this works, but I think this is something worth exploring.

Indeed, it does seem possible to figure out where simple factual information is stored in the weights of a LLM, and to distinguish between knowing whether it "knows" a fact versus it simply parroting a fact.

Thanks for this. It's useful to me.

I found this talk by Sho Yaida, one of the authors of the book: https://youtu.be/XAuz08GuY9A

I'm about halfway through listening. He's a good speaker, but I'm still trying to understand the level of predictive power gained from this theory and to what extent it could help with alignment.

Very interesting. Can you say more about what sorts of things can be predicted by this theory of neural networks? What kind of dent do you think this knowledge can make into interpretability research?

The predictions laid out in the book are mostly about how to build a perceptron such that representation learning works well in practice, and that the generalisation error gets minimised. For example,

but finitefor representation learning to work. An optimal ratio is also approximated.Many more concrete, testable, numeric results like this are derived. The idea is that this is just the beginning and a lot more could potentially be derived. You can use the theory to express any observable (analytic combination of pre-activations anywhere in the network) that you might be interested in and study its statistics.

One more thing I should probably have added: I am only talking about the distributional shift in input data, which is important. But I think Eliezer is also talking about another kind of distributional shift that comes from a change in ontology. I am confused about how to think of this. Intuitively it is "the world hasn't changed, just how I look at it", whereas I discuss "the world has changed" (because the agent is doing things that haven't occurred during training).

This feels important but after the ideal gas analogy it's a bit beyond my vocabulary. Can you (or another commenter) distill a bit for a dummy?

The ideal gas law describes relations between macroscopic gas properties like temperature, volume and pressure. E.g. "if you raise the temperature and keep volume the same, pressure will go up". The gas is actually made up of a huge number of individual particles each with their own position and velocity at any one time, but trying to understand the gas's behavior by looking at long list of particle positions/velocities is hopeless.

Looking at a list of neural network weights is analogous to looking at particle positions/velocities. This post claims there are quantities analogous to pressure/volume/temperature for a neutral network (AFAICT it does not offer an intuitive description of what they are)

I did not write down the list of quantities because you need to go through the math to understand most of them. One very central object is the neural tangent kernel, but there are also algorithm projectors, universality classes, etc., each of which require a lengthy explanation that I decided to be beyond the scope of this post.

I think 3blue1brown's videos give a good first introduction about neural nets (the "atomic" description):

Does this help?