Race Along Rashomon Ridge

Peter S. Park; MichaelEinhorn

Various takes on this research direction, from someone who hasn't thought about it very much:

Given two optimal models in a neural network’s weight space, is it possible to find a path between them comprised entirely of other optimal models?

I think that the answer here is definitely no, without further assumptions. For example, consider the case where you have a model with two ReLU neurons that, for optimal models, are doing different things; the model where you swap the two neurons is equally performant, but there's no continuous path between these models.

More concretely, imagine that we want to learn (y1, y2) = (ReLU(x1), ReLU(x2)) with a model of the form

def model(x1, x2):
    h1 = ReLU(a * x1 + b * x2)
    h2 = ReLU(c * x1 + d * x2)
    return (e * h1 + f * h2, g * h1 + h * h2)

where a through h are parameters. In Pytorch we'd write this as W2 @ relu(W1 @ x), for W1 and W2 of shape [2, 2].

We can model the true function perfectly by either using the first neuron for x1 and the second for x2 or vice versa, but there's no continuous path between them.

(As a simpler example, consider the task of choosing parameters a and b such that a*b = 1. There's no continuous path from the (1, 1) solution to the (-1, -1) solution.)

Plausibly you could choose some further assumptions that make this true. You gesture at this a little:

However, when the class of models is sufficiently overparametrized (i.e., large-scale neural net architecture), we eventually have just one connected ridge of local minima: the ridge of global minima.

But IMO a whole lot of the work here is going to be choosing a definition of "sufficiently overparameterized" such that the theorem isn't easily counterexampled.

And from a relevance-to-alignment standpoint, I think it's pretty likely that if you succeed at finding a definition of "sufficiently overparameterized" such that this is true, that definition will require that the model is sufficiently "larger" than the task it's modeling that it will seem implausible that the theorem will apply in the real world (given that I expect the real world to be "larger" than AGIs we train, and so don't expect AGIs to be wildly overparameterized).

[-]Buck3y62

Another way of phrasing my core objection: the original question without further assumptions seems equivalent to "given two global minima of a function, is there a path between them entirely comprised of global minima", which is obviously false.

[-]Stephen Fowler3y20

Hey Buck, love the line of thinking.

We definitely aren't trying to say "any two ridges can be connected for any loss landscape" but rather that the ridges for overparameterised networks are connected.

[-]Thomas Larsen3y10

TL;DR: I agree that the answer to the question above definitely isn't always yes, because of your counterexample, but I think that moving forward on a similar research direction might be useful anyways.

One can imagine partitioning the parameter space into sets that arrive at basins where each model in the basin has the same, locally optimal performance, this is like a Rashomon set (relaxing the requirement from global minima so that we get a partition of the space). The models which can compress the training data (and thus have free parameters) are generally more likely to be found by random selection and search, because the free parameters means that the dimensionality of this set is higher, and hence exponentially more likely.

Thus, we can move within these high-dimensional regions of locally optimal loss, which could allow us to find more interpretable (or maybe more desirable along another axis), which is the stated motivation in the article:

Ultimately, we hope that the study of equivalently optimal models would lead to advances in interpretability: for example, by producing models that are simultaneously optimal and interpretable.

This seems super relevant to alignment! The default path to AGI right now to me seems like something like a LLM world model hooked up to some RL to make it more agenty, and I expect this kind of theory applied to LLMs, because of the large number of parameters. I'm hoping that this theory gets us better predictions on which Rashomon sets are found (this would look like a selection theorem), and the ability to move within a Rashomon set towards parameters that are better. Such a selection theorem seems likely because of the dimensionality argument above.

[-]Peter S. Park3y*21

Edit: Adding a link to "Git Re-Basin: Merging Models modulo Permutation Symmetries," a relevant paper that has recently been posted on arXiv.

Thank you so much, Thomas and Buck, for reading the post and for your insightful comments!

It is indeed true that some functions have two global minimizers that are not path-connected. Empirically, very overparametrized models which are trained on "non-artificial" datasets ("datasets from nature"?) seem to have a connected Rashomon manifold. It would definitely be helpful to know theoretically why this tends to happen, and when this wouldn't happen.

One heuristic argument for why two disconnected global minimizers might only happen in "artificial" datasets might go something like this. Given two quantities, one is larger than the other, unless there is a symmetry-based reason why they are actually secretly the same quantity. Under this heuristic, a non-overparametrized model's loss landscape has a global minimum achieved by precisely one point, and potentially some suboptimal local minima as well. But overparametrizing the model makes the suboptimal local minima not local minima anymore (by making them saddle points?) while the single global minimizer is "stretched out" to a whole submanifold. This "stretching out" is the symmetry; all optimal models on this submanifold are secretly the same.

One situation where this heuristic fails is if there are other types of symmetry, like rotation. Then, applying this move to a global minimizer could get you other global minimizers which are not connected to each other. In this case, "modding out by the symmetry" is not decreasing the dimension, but taking the quotient by the symmetry group which gives you a quotient space of the same dimension. I'm guessing these types of situations are more common in "artificial" datasets which have not modded out all the obvious symmetries yet.

[+][comment deleted]3y20

[-]DaemonicSigil3y40

In your example, I think even adding just one more node, h3, to the hidden layer would suffice to connect the two solutions. One node per dimension of input suffices to learn the function, but it's also possible for two nodes to share the task between them, where the share of the task they are picking up can vary continuously from 0 to 1. So just have h3 take over x2 from h2, then h2 takes over x1 from h1, and then h1 takes over x2 from h3.

[-]Buck3y30

Yeah, but you have the same problem if you were using all three of the nodes in the hidden layer.

[-]Quintin Pope3y10

Actual trained neural networks are known to have redundant parameters, as demonstrated by the fact that we can prune them so much.

[-]Lech Mazur3y10

Have you seen this paper? They find that "SGD solutions are connected via a piecewise linear path, and the increase in loss along this path vanishes as the number of neurons grows large."

[-]Charlie Steiner3y74

Interesting stuff. Prunability might not be great for interpretability - you might want something more like sparsity, or alignment of the neuron basis with specified human-interpretable classifications of the data.

[-]Peter S. Park3y20

Thanks so much, Charlie, for reading the post and for your comment! I really appreciate it.

I think both ways to prune neurons and ways to make the neural net more sparse are very promising steps towards constructing a simultaneously optimal and interpretable model.

I completely agree that alignment of the neuron basis with human-interpretable classifications of the data would really help interpretability. But if only a subset of the neuron basis are aligned with human-interpretability, and the complement comprises a very large subset of abstractions (which, necessarily, people would not be able to learn to interpret), then we haven't made the model interpretable.

Suppose 100% is the level of interpretability we need for guaranteed alignment (which I am convinced of, because even 1% uninterpretability can screw you over). Then low-dimensionality seems like a necessary, but not sufficient condition for intepretability. It is possible, but not always true, that each of a small number of abstractions will either already familiar to people or can be learned by people in a reasonable amount of time.

[-]Algon4mo40

Looking at your map, you notice there's a long ridge that leaves your mountain at roughly the same height. Does this ridge to reach your friend's mountain? possible to follow this ridge to reach your friend's mountain?

"ridge to" -> "ridge",
"possible" to "Is it possible".

However, when the class of models is sufficiently overparametrized (i.e., large-scale neural net architecture), we suspect that eventually have just one connected ridge of local minima: the ridge of global minima. The dimension of this single ridge is smaller than the dimension of the whole weight space, but still quite high-dimensional. Intuitively, the large number of linearly independent zero-loss directions at a global minimum allow for many opportunities to path-connect towards a local minimum while staying within the ridge, making all local minima globally minimal.

"that eventually have" -> "they eventually have",

"path-connect towards" - > "path connect to".

Considering getting a linter for your browser, like https://languagetool.org, to avoid these sorts of errors in future.

One potential choice of a subset of interpretable models $W_{interpretable}$ which is geometrically "nice'' is the submanifold of models which is prunable in a certain way. For example, the submanifold defined by the system of equations $w_{j} = w_{j + 1} = \dots = w_{k} = 0$ , where $w_{j}, \dots, w_{k}$ are the output connections of a given neuron $n$ , is comprised of models which can be pruned by removing neuron number $n$ . Thus, we may be able to maximally prune a neural net (with respect to the given dataset) by using an algorithm of the following form:
Find in the Rashomon manifold an optimal model that can be pruned in some way.
Prune the neural net in that way.
Repeat Steps 1 and 2 until there are no simultaneously prunable and optimal models anymore.

This is clever, and also seems like the core idea of the post. You should put this info at the start.

[-]tgb3y42

This is a nice write-up, thanks for sharing it.

It seems like it's possible to design models that always have this property. For any model M, consider two copies of it, M' and M''. We can construct a third model N that is composed of M' and M'' plus a single extra weight p (with 0 <= p <= 1). Define N to have output by running the input on M' and M'' and choose the result of M' with probability p and otherwise the result of M''. Any N achieves minimal loss if and only if it's one of the following cases:

M' and M'' are both individually optimal and 0 < p < 1,
M' is optimal and p = 1 (M'' arbitrary), or
M'' is optimal and p = 0 (M' arbitrary).

Then to get a path between any optimal weights, we just move to p = 0, modify M'' as desired, then move to p=1 and modify M' as desired, then set p as desired. I think there's a few more details than this; it probably depends upon some properties of the loss function, for example, that it doesn't depend upon the weights of M only the output.

Unfortunately this doesn't help us at all in this case! I don't think it's any more interpretable than M alone. So maybe this example isn't useful.

Also, don't submersions never have (local or global) minima? Their derivative is surjective so can never be zero. Pretty nice-looking loss functions end up not having manifolds for their minimizing sets, like x^2y^2. I have a hard time reasoning about whether this is typical or atypical though. I don't even have an intuition for why the global minimizer isn't (nearly always) just a point. Any explanation for that observation in practice?

[-]harfe3y33

Also, don't submersions never have (local or global) minima?

I agree, and believe that the post should not have mentioned submersions.

Pretty nice-looking loss functions end up not having manifolds for their minimizing sets, like x^2 y^2. I have a hard time reasoning about whether this is typical or atypical though. I don't even have an intuition for why the global minimizer isn't (nearly always) just a point. Any explanation for that observation in practice?

I agree that the typical function has only one (or zero) global minimizer. But in the case of overparametrized smooth Neural Networks it can be typical that zero loss can be achieved. Then the set of weights that lead to zero loss is typically a manifold, and not a single point. Some intuition: Consider linear regression with more parameters than data. Then the typical outcome is a perfect match of the data, and the possible solution space is a (possibly high-dimensional) linear manifold. We should expect the same to be true for smooth nonlinear neural networks, because locally they are linear.

Note that the above does not hold when we e.g. add a l2-Penalty for the weights: Then I expect there to be a single global minimizer in the typical case.

[-]Peter S. Park3y30

Thank you so much for this suggestion, tgb and harfe! I completely agree, and this was entirely my error in our team's collaborative post. The fact that the level sets of submersions are nice submanifolds has nothing to do with the level set of global minimizers.

I think we will be revising this post in the near future reflecting this and other errors.

(For example, the Hessian tells you what the directions whose second-order penalty to loss are zero, but it doesn't necessarily tell you about higher-order penalties to loss, which is something I forgot to mention. A direction that looks like zero-loss when looking at the Hessian may not actually be not actually be zero-loss if it applies, say, a fourth-order penalty to the loss. This could only be probed by a matrix of fourth derivatives. But I think a heuristic argument suggests that a zero-eigenvalue direction of the Hessian should almost always be an actual zero-loss direction. Let me know if you buy this!)

[-]Tom Lieberum3y10

If the subset of interpretable models is also "nice" in the differential-geometric sense (say, also a smooth submanifold of $W$ ), then the intersection $W_{optimal} \cap W_{interpretable}$ is also similarly "nice."

Do you have any intuition for why we should expect $W_{interpretable}$ to be "nice"? I'm not super familiar with differential geometry but I don't really see why this should be the case..