Produced As Part Of The SERI ML Alignment Theory Scholars Program 2022 Research Sprint Under John Wentworth
Two Deep Neural Networks with wildly different parameters can produce equally good results. Not only can a tweak to parameters leave performance unchanged, but in many cases, two neural networks with completely different weights and biases produce identical outputs for any input.
The motivating question:
Given two optimal models in a neural network's weight space, is it possible to find a path between them comprised entirely of other optimal models?
In other words, can we find a continuous path of tweaks from the first optimal model to the second without reducing performance at any point in the process?
Ultimately, we hope that the study of equivalently optimal models would lead to advances in interpretability: for example, by producing models that are simultaneously optimal and interpretable.
Introduction
Your friend has invited you to go hiking in the Swiss Alps. Despite extensive planning, you completely failed to clarify which mountain you were supposed to climb. So now you're both currently standing on different mountaintops Communicating via yodels, your friend requests that you hike over to the mountain he's already on; after all, he was the one who organised the trip.
How should you reach your friend? A straight line path would certainly work, but it's such a hassle. Hiking all the way back down and then up another mountain? Madness.
Looking at your map, you notice there's a long ridge that leaves your mountain at roughly the same height. Does this ridge lead to reach your friend's mountain and is it possible to follow this ridge to reach your friend?
Similarly, we can envision two optimal models in a neural net's weight space that share a "ridge" in the loss landscape. There is a natural intuition that we should be able to "walk" between these two points without leaving the ridge. Does this intuition hold up? And what of two neural net