Produced As Part Of The SERI ML Alignment Theory Scholars Program 2022 Research Sprint Under John Wentworth
The "broad basins induce data compression" argument goes like this: If we train a neural network to memorize some data, it has to "store" all the relevant information in its weights. If it "compresses the data", then it should be able to store the data using fewer weights, and leave more weights free to vary without increasing the loss. This means that around the optima we found, there should be a broad flat basin in the loss landscape.
We spent eight days trying to make this argument more specific and clear. We made some progress but probably none of it is new to people who have studied this area.
- Rashomon dimension experiments: Our most interesting result is that most of the time the highest dimension basin is not the one found, it tends to most commonly find a slightly lower dimension basin.
- PAC-Bayes operationalization: An attempt at finding the “true name” of the Rashomon set dimension, which ends up being the KL divergence between the “prior” and “posterior”, for a particular definition of those words.
One common idea that we keep coming back to is that we can think of “using some parameter-degrees of freedom” as the degrees of freedom that don’t change the loss on the training set. The set of hypotheses that has zero loss is called the Rashomon set, and we know that every hypothesis in the Rashomon set has learned every training data label. Hence, the area of this set is a nice operationalization of the degrees of freedom that aren’t relevant to memorizing the training labels.
One way of operationalizing the amount of “compression” the network has done is to measure the dimensionality of the Rashomon set, and subtract that from the total number of parameters. This vaguely gives us “the number of parameters the network is using to store the data”. We try to generalize this notion in the last section.
We define the set of points in parameter space that end up in the same Rashomon set after training, assuming gradient flow, as the "attractor well" of that Rashomon set.
Visualizing the basin
We started with some experiments with the following goals:
- Qualitatively understanding the shape of basins
- Understanding the relative likelihoods of each basin.
The following graph is a visualization of two randomly sampled axes through the loss basin of a 9 parameter network trained on an XOR dataset (which is just 4 points:).
The next graph does the same visualization, except slicing the space through the two eigenvectors that correspond to the lowest eigenvalues.
This graph is significantly flatter, because moving in the low-eigenvalue directions lets us remain in the Rashomon set.
We also put together a program that finds paths between connected minima. It does this by making steps in the 0-eigenspace of the Hessian, allowing it to trace a path between networks in the same Rashomon set by projecting the difference between their parameters into the 0-eigenspace of the Hessian, and then stepping in that direction. This requires computing the eigendecomposition of the Hessian, which takes with our implementation, but can be improved up to , where n is the number of parameters and k is the number of eigenvalues we need. We have been able to find these paths in the XOR experiment by discretizing the space and doing a depth-first search of this Rashomon set.
The image below is another way of visualizing the loss landscape around the minima, by taking the minima and nudging it in a bunch of random directions, measuring the loss at all those points and then projecting the set of parameterizations into 2d space using tSNE. It is useful in that it shows us that a large proportion of the random directions give the same loss.
Broad basin experiment
On the XOR dataset, Thomas set up experiments that measured the number of almost-zero eigenvalues of the Hessian of the loss at the minima. Each graph below shows the distribution of near-zero eigenvalues over 150 training runs with different random initializations (the weights were initialized with a Gaussian).
The number of near zero eigenvalues of the Hessian are a measure for the dimensionality of the Rashomon set, a manifold in the loss landscape with constant loss. This dimensionality is one way of defining the broadness of the basin in the loss landscape.
We were expecting the broad basins to be much more likely, due to their increased dimensionality in the space.We expect higher dimensional basins to tend to have higher dimensional “attractor wells”, which means that the highest dimensional basins should attract most of the points in the initialization distribution. So we would predict most of the basins found to have the maximum possible dimension allowable while still fitting the data points.
Loss attractor wells with dimension smaller than the total number of parameters seem like they would usually have 0 measure. The intuition for this is that the higher dimensional analogue of a line has 0 measure in a 2D space or a square has 0 measure in a 3D space.
However, we observe a much more symmetrical distribution than we predicted, as you can see in the graphs above — the mode is a little lower than the maximum possible dimension, and the number of maximum dimension minima found decreases as we increase the width of the network.
There are several possible explanations for this:
- The pushforward can sometimes just push lots of mass onto a lower-dimensional subset of parameter space, if there is a higher dimensional attractor well to the lower dimensional hessian.
- There might be more symmetries (in parameter space) of the slightly lower dimensional basins, meaning there are more of these basins available.
- This may be related to the asymptotic equipartition property?
- The randomness in the free parameters tends to give us high rank Hessians, which correspond to low rank 0-eigenspaces.
Below are synthetically generated plots of example loss landscapes to illustrate the dimensionality of attractor wells. On the left, we have just plotted . This Rashomon set is given in red, and it is the points for which (as these are the points for which is minimized. This is a one dimensional set, but it has a two dimensional attractor basin.
On the right, we have plotted if , 0 if , and if . This is a two dimensional set, but it still has a 2d attractor set.
PAC-Bayes inspired operationalization
Let’s now try to find a better version of the “dimension of the Rashamon set” measure of information storage. One way we could try to do it is using the ratio of the volume of the Rashomon set compared to the entire hypothesis space. However, “volume” of the Rashamon set itself isn’t the right concept, because usually the Rashamon set has a lower dimension than the parameter space it is embedded in, and hence it has zero volume. In some directions it is “flat”, hence overall it has 0 volume. To resolve this, we’ll have to use a different measure of “volume”.
Types of “compression”
When trying to be more specific about what we are talking about, it’s useful to make the following distinction. There are two types of “compression” that we need to consider:
- Program length complexity of a specific function, i.e. Kolmogorov Complexity of a function, given some Turing machine U. This is equivalent to the message length required to specify a particular outcome, given some prior distribution over outcomes. It is the negative log of the prior probability of a discrete outcome.
- Vagueness complexity, i.e. when you have a set of functions and it doesn’t matter which one is chosen. Intuitively, this allows you to be less specific with a description of the function, which makes it more compressible. We can measure this with the size of the set, or more generally, with the entropy of a distribution over the set.
Trying to fit this into the PAC-Bayes framework:
One natural way of combining these two aspects of compression is to take the KL divergence between the posterior and the prior (defined below), with the posterior being the distribution that captures the “vagueness” of which function we want to specify. This works nicely because the KL divergence is the expected number of bits required to encode a hypothesis from the posterior (using a code optimized for the prior) minus the entropy of the posterior, which we conjecture is a good measure of the number of bits we don’t need to specify.
To define a posterior distribution with neural networks, we can consider a distribution over the hypothesis space that tells us how likely a particular hypothesis is to be chosen by the training procedure, given a particular dataset. This is the pushforward measure of the initialization distribution across the training procedure. We can call this the “posterior”.
We can define a “prior” as some distribution over hypotheses that captures our beliefs about what hypothesis will be chosen, before we have observed any training data.
We conjecture that the KL divergence between the prior and posterior distributions is analogous to subtracting the number of dimensions of the Rashomon set from the total number of parameters. KL(Q||P) = CrossEntropy(Q,P) - H(Q), where if P is uniform, the number of bits required to encode any sample in the posterior CrossEntropy(Q,P) is proportional to the number of parameters. H(Q) is the entropy of the posterior, which would be proportional to the number of dimensions of P if P was uniform.
One piece of evidence that suggests that the KL divergence is the right way to measure this is that this term is present in the PAC-Bayes generalization bound, which has been used to prove non-vacuous (~10x the empirical error) guarantees on the generalization performance of overparameterized neural networks (Dziugaite & Roy, 2017). Here’s an example of the basic form of a PAC-Bayes bound on the expected test loss:
This operationalization (using KL to compare prior and posterior) gives us a more general definition of what we mean by the parameters “using” some degrees of freedom and not others, which can be applied to situations where we have multiple disconnected basins, or connected basins with different dimensionality.
Another thing that this operationalization gives us is the sense in which a larger “posterior” implies a shorter description length. Unfortunately, it is kind of a vacuous argument, because we have to define “shorter description length” using the implicit prior distribution. But it does reframe the problem to one of understanding the implicit prior, which we might do via understanding the biases of the parameter -> function space map.
One thing that is missing from this operationalization is why neural networks usually generalize well, even though they are sampling from the posterior instead of marginalizing over it. We would expect this to be the case if most hypotheses in the posterior generalized in a similar way (e.g. had similar prior probability), however it’s unclear whether this is true.
Another thing that is missing is a better description of the “prior”. It seems clear that there is an implicit bias in the parameter -> function space mapping or in gradient descent, which biases neural networks to tend to output certain types of functions. We will call this bias the implicit “prior” of NNs. Some evidence suggests that the details of the optimization algorithm are irrelevant to the generalization of neural networks (Video, related paper: Goldstein et al, 2020), so we expect that the bias is mostly in the parameter space -> function space mapping.
So we conjecture that we can define the implicit “prior” of neural networks as an almost uniform distribution over parameter space, with some limit on the distance from initialization. We might also use a normal distribution centered at zero. We know that this implies a highly non-uniform distribution over functions, because in the infinite width limit, the distribution is a Gaussian Process with mean zero and kernel function that depends on the architecture.
Together, the above paragraphs gives us the prediction that the posterior has most of the mass in wide basins, with the set of hypotheses in a given wide basin being dominated by smooth simple hypotheses kind of like those sampled from the Gaussian Process approximation due to the parameter -> function mapping.