Introduction

For a SERI MATS research sprint under John Wentworth, teams were given the following research prompt:

Everyone's assignment will be to figure out how to operationalize the "broad peaks induce data compression" argument. Here's the intuitive argument: I train some neural net to reproduce the training data. Conceptually, the trained net has to "store" all the relevant information from the dataset in its weights. If it "compresses the data", then it should be able to store the data using fewer weights, so more weights are free to vary without increasing loss much. "Lots of weights left free to vary without increasing loss much" = broad peaks. Or, to put it differently: there are more parameter-values for which the net reproduces the training data using a more compressed representation, so training is more likely to find a more compressed representation.So, the assignment is to figure out how to operationalize things like:
"the representation of the data" (or the embedding of that representation in the weights)
what it means for the representation to "use" some parameter-degrees-of-freedom but not others (bear in mind that "weights" might not be the right ontology, e.g. maybe it uses some directions-in-parameter-space but not others, or maybe something else entirely)
what it means for two different parameter-settings of the net to "use the same representation" (or, more generally, to "do the same thing" internally)
When operationalizing this, we're looking for three main things:
Testability: the operationalization should be in terms of things which you can directly look for in a trained net (possibly relying on some approximations). This is mainly to make sure the operationalization stays grounded.
Generalizability: the operationalization should be clearly generalizable to new kinds of systems which don't look like the kinds of systems we're currently working with (maybe imagine biological systems). We're not looking for a narrow ad-hoc model, here.
Provability: using the operationalization, we should be able to turn the intuitive argument into a mathematical proof (up to approximation/events of high probability). We should be able to argue mathematically that training probably finds more compressed representations.

Our team decided to try to tackle what it means for two different parameter-settings of the net to use the same representation, as this seemed the most likely way the full argument would be formalized, and the straightforward approach (to use Hessians and basin volumes to look at peak broadness) would be what most other teams would do.

Summary

A motivating example for our first attempt to operationalize “doing the same thing internally” was the Rubik’s cube. These are solved by using set algorithms developed to get certain blocks to certain spaces without changing solved portions. Two algorithms are similar when you can make a simple change to one, and get the other. An example: instead of solving the white face first, and yellow face last, you can solve the yellow first, and white last. Another would be that whenever you make a 90-degree turn clockwise, you make three 90-degree turns counter-clockwise.

The primary idea was that different things do the same thing in the same way when you could easily predict the kinds of operations one thing does given knowledge about the other thing.

For neural networks we thought this notion arose via symmetries in parameter space. There are functions you could apply to networks which leave their outputs the same. Networks resulting from the application of simple functions should be “doing the same thing” in a meaningful sense, and networks resulting from the application of very complex functions have no guarantee of doing the same thing. A simple function would be rearranging the neurons and corresponding weights, and a complex function would be training a network on the outputs of your given one.

At this point, we were very naive in the ways of information theory, and thought to ourselves ‘hey, we’re saying there’s functions between networks. Right? Well, if there's an invertible function between them, then their mutual information should be just as large as their total information, so we can just measure mutual information between two networks!’ Unfortunately for us, this either doesn’t make any sense, or makes sense only if you’re willing to train several billion networks on several trillion different tasks (both estimates of the true cost).

So instead we thought more about the basic properties we’d like these algorithm translation functions to have. When you apply such a function to a Rubik’s cube, and make all right hand turns into three left hand turns, then this has the property that after executing the alternative 3 steps, the Rubik’s cube is in the same state as it would have been after executing the original step. So one may think the key would lie in looking for steps in the computational process which return the same result. However, this notion doesn’t capture the idea that if you start solving the Rubik’s cube from the yellow face rather than the white face, but otherwise use the same algorithm, then though this feels like the same way of solving the cube, at no point do the algorithms get you in the same state except at the very beginning and very end. There isn’t even a particularly simple function to translate from the intermediate states of the white-first method to the intermediate states of the yellow-first method.

This is when we started to lose our way a little bit. Possibly it would have been a good idea to spend a few more days developing the above theory, then run into an experiment, but we thought it was possible we were making some earlier mistake in reasoning, so we decided to try to implement an algorithm which matches similar computational states to each other. We reasoned that there should be a bijection between similar parts of similar computation states, and the extent of this bijection could be approximated by looking at the mutual information between different (in the case of neural networks) activation layers.

We also thought that because it’s easy to understand the fundamental structure of simple functions, and hard to find the fundamental structure of complex functions, it should be the case that a network trained on the simple function would “store” little of the data, and more of the fundamental structure, while a network trained on a complex function should be more liable to store all of the data, or at least, implement something equivalent to a lookup table on the data. Thus, simple functions will do as many different things if you can implement the simple function using many different methods, and complex functions will do only one thing (a lookup table).

So we chose our simple function: return the first bit of a 5-bit sequence. And our complex function: return a random (but consistent) bit after seeing a 5-bit sequence. Then we trained 20 networks on each function, and plotted various mutual information measures of the networks’ activations, considering two activations the same if they were 1e-5 L1 distance divided by the number of entries of each other^[1].

We figured the layers in complex functions would have more similar information content than the layers in simple functions, and we chose two methods of testing how true this was.

Graph the total information contained in each layer of each simple network, and on a separate graph that information contained in each layer of each complex network
Compare the mutual information between corresponding layers of the simple networks and corresponding layers of the complex networks

The problem with this method, which we only realized after actually trying it, was that it is testing a proxy for a theory with a proxy for a result.

Here, our theory was that certain classes of functions for mapping algorithms between each other represent ways you can change a particular algorithm without changing what it is fundamentally doing, and our proxy for this was information content between and inside activation layers. Our anticipated result was that algorithms which do similar things will often but not always have similar computation states, while the proxy for this prediction was that there are more ways to copy the first bit of an input than there are ways to memorize input-output pairs.

I said before that this is where we lost our way. This is because there were plausibly more dubious reasoning steps in this setup here than there were in our original theory reasoning, and we were doing this experiment to verify that our original theory reasoning was valid!

Our results are evidence against the theory, but plausibly more evidence against our ability to come up with high-information-value experiments. Even more evidence in favor of this is that after experimenting, our overall reaction was confusion on where to go next. We figured at the moment this meant our original theory was incorrect, but we were still stuck.

Each of us then used the babbling technique to generate three lists of ten possible ideas for new directions to go in, and Ian had the idea that we could use interpretability-like techniques to look at whether similar sets of neurons have high activation after feeding in similar types of inputs, which was similar to our previous informational measures, but seemed like it’d give richer data and correspondences. The experiments for this approach have been written, but we did not have time to actually analyze the generated data.

Mutual information details

The idea is that if the mutual information between the activations of two layers of a neural network is high, then they are representing the same information. The more layers in two networks that represent the same information, the more the programs implemented by the functions are similar, since they pass through congruous intermediate representations.

In particular, we compute the mutual information between two layers and $y$ , $MI (x, y)$ , the total information stored in those two layers,

TI (x, y) = H (x) + H (y) - M I (x, y) = H ((x, y))

and take the ratio

$IR (x, y) = \frac{MI (x, y)}{TI (x, y)}$

which we call the information ratio.

If this ratio is exactly 1, then there is a bijection between the activations and they represent the same ‘state’ in the computation. If the ratio is close to 1, then we expect the two sets of activations to represent approximately the same computational state, since they share almost all the same information.

The similarity score we propose for networks $X = x_{1}, x_{2}, \dots, x_{m}$ and $Y = y_{1}, y_{2}, \dots, y_{n}$ is the mean information ratio

$Sim (X, Y) = \frac{1}{m n} \sum_{i = 1}^{m} \sum_{j = 1}^{n} IR (x_{i}, y_{j}),$

(Notably this similarity score isn't at its maximum when comparing a network to itself, so while we used it as a test method, we were not satisfied with it.)

Mutual information based results

Mutual information with the input

It is hard to take too much away from these graphs, but it seems that the simple networks have the potential to throw away more information than the complex ones.

MI with input data of the layers of 200 nets trained on the simple dataset (first bit is the label)

MI with input data of the layers of 200 nets trained on the complex dataset (random labels). Note that the minimum MI is higher than for simple (axes are not even).

MI with input data of the layers of 200 nets trained on the simple dataset represented with the 20th, 40th, 60th, and 80th quantile at each layer.

MI with input data of the layers of 200 nets trained on the complex dataset represented with the 20th, 40th, 60th, and 80th quantile at each layer.

Mutual information between corresponding layers of different neural networks

Similarly to the previous section, the minimum amount of information shared between networks in their final layers is slightly higher than for the simple nets, and may on average be a few bits higher (although we would need to check this more to be sure). There also seems to be structure to this data, indicating this metric captures an interesting aspect of computation, but the graphs for complex functions seem more messy than simple functions, possibly providing evidence our original theory was wrong, although it’s really hard to tell which way this points.

MI with corresponding layers of 20 nets trained on the simple dataset (first bit is the label).

MI with corresponding layers of 20 nets trained on the complex dataset (random labels). Note that the minimum MI is higher than for simple (axes are not even).

Similarity scores of 30 networks with one reference network, all trained on the simple task.

Similarity scores of 30 networks with one reference network, all trained on the complex task.

Explanation of the interpretability idea

After we got into a bit of an idea-rut, we decided to all generate 10 dumb ideas using the babble method, with the hope that one of them would turn out to not be dumb. Ian came up with the idea of using mechanistic interpretability-like techniques to distinguish between functions the network may be running, and this seemed the most promising idea at the time, since it allowed us to verify whether any particular notion of two networks ‘doing the same thing’ was equivalent. The idea is as follows.

In general, we can’t expect particular neurons to correspond to particular, nicely defined functions. For example, we may have a situation where the results of the nicely defined functions are put into any orthogonal direction in the activation space rather than the particular set of orthogonal directions represented by each of the neurons. This makes it difficult to tell in general what the particular, nicely defined, functions the network is implementing are.

However, if we only have binary inputs, and we limit our focus to only the first layer of the network, we know that this layer can’t be implementing anything more complex than an xor function. More generally, we can break our network up into simple pieces for which we can make an upper bound on the complexity of the functions it could be implementing. Then, we can list all of the functions it could be implementing, and look at which neurons have what values. This gives us a function from ‘functions the network could be implementing’ to neuron values.

Now, if we hand-code a several nice networks, so that each neuron is implementing a very cleanly defined function, and our networks encompass all the possible ‘ways’ one could compute a given function, by optimizing inputs for those functions (or neurons since we have a nice network), then feeding those inputs into a different, possibly messy, network, we can gain insight into what kinds of functions each activation layer in the messy network is implementing.

We decided to test how fruitful this approach was by training many networks to compute a fuzzy-logic version of $f (a b) = a \oplus b$ ,^[2]^[3] but we were only able to train 2 before the project completion date was approached, and never actually got to develop any form of methodology we could scale up to larger networks.

Individual reflections

Matthias

We experimented without having much of a theory developed to what we expect the results to look like (only a little) and how we would update in the case the results turn out a certain way
The similarity measure seemed (and turned out to be) arbitrary.
There are a lot of variables in the experiments like when to consider a net to have 0 loss. (like less than 1e-3/1e-5/1e-7) These choices were made pretty random but it could very well be the case that they are very important especially in the information case. We did not have a back and forth about these approximations between the theory part and the experiment part.

Ian

We moved away from the original prompt to focus on a related question, which we hoped would be tractable and allow us to pursue a different line of investigation than other groups. However, it feels like this new question was somewhat disconnected from the original prompt and left the space of possibilities extremely broad such that it was hard to focus on a specific approach.
We came up with some promising-sounding ideas fairly early on but rushed to test them rather than think through the implications of the theory or find counter-examples. We didn’t understand our experimental set-up well enough to disentangle evidence about our theory from details of the set-up (such as the specific datasets and architectures used, the effect of the tolerance used to decide when two things were equal, the number of runs we would need to extract signal from the noise even if our set-up was solid).
Developing the theory to the point that it made very concrete predictions about small set-ups with very few degrees of freedom would have helped us to validate that we were on the right track.
For all the experiments, having a better sense of how we would update given the evidence collected after an experiment would have been useful.
This is a general problem I have – if all the experiments you run give inconclusive, messy, noisy results, how are you supposed to draw conclusions and make progress? Often it feels like small experiments with clear outcomes are not representative enough to be informative, and larger experiments rarely point in a clear direction.

Garrett

Basically the same as Ian, except the first point. I stand by our decision to take the question the direction we took it, and if we chose to take it in a different direction, I’d want us to try to do something like operationalize what it means for data to be encoded in a network rather than for two networks to be doing the same thing.

^{^}
We played around with a bunch of measures, and decided on the one which had the invariance H((X,Y)) = H(X) + H(Y) - I(X;Y).
^{^}
$\oplus$ is the xor binary operator for those unfamiliar with this notation.
^{^}
This fuzzy-logic version being one where $OR(a,b) = max(a,b)`, `AND(a,b) = min(a,b)$ , $NOT(a) = 1-a$ , and where $XOR(a,b) = AND(OR(a, b), NOT(AND(a,b))))$ .

LESSWRONG
is fundraising!
LW