LESSWRONG
LW

World Modeling
Frontpage

12

[ Question ]

(When) do high-dimensional spaces have linear paths down to local minima?

by Thomas Kwa
22nd Apr 2022
1 min read
A
3
7

12

World Modeling
Frontpage

12

(When) do high-dimensional spaces have linear paths down to local minima?
5Razied
3Dennis Towne
4tailcalled
2Razied
1Cedar
7ACrackedPot
1Cedar
New Answer
New Comment

3 Answers sorted by
top scoring

Razied

Apr 22, 2022*

50

This is certainly true for all convex loss landscapes in all dimensions, almost by definition of "convex". I don't think anyone understands very much about the properties of non-convex high-dimensional loss landscapes, but I can say that in the case of deep learning having monotonically decreasing loss on the linear path between the initialization and the end of training, the weights we obtain when we arbitrarily decide to stop gradient descent aren't anywhere close to a local minimum of the landscape. Basically all networks of any useful size get stuck at saddle points at best, or we stop training before they even get the chance to be stuck. So it might be the case that a linear path to the actual minimum would not have monotonic loss at all, it's just that high-dimensional spaces are so mindbogglingly large that we never explore far enough from the init point to have the chance to get behind a mountain in the landscape, so to speak. 

Add Comment

Dennis Towne

Apr 22, 2022

30

Specifically for protein folding:  no, it does not decrease monotonically, unless you look at it from such a large distance that you can ignore thermal noise.

Proteins fold in a soup of water and other garbage, and for anything complicated there are going to be a lot of folding steps which are only barely above the thermal noise energy.  Some proteins may even "fold" by performing a near-perfect random walk until they happen to fall into a valley that makes escape unlikely.

There may even be folding steps which are slightly disfavored, eg. require energy from the environment.  Thermal noise can provide this energy for long enough that a second step can occur, leading to a more stable configuration.

Add Comment
[-]tailcalled3y40

The folding steps aren't linear paths though.

Reply

Razied

3y

20

gradient descent is not an optimization algorithm, it's an algorithm for finding local-minima-with-somewhat-monotone-paths.

If gradient descent isn't an "optimization algorithm", then I'm not sure what those words mean, unless you restrict "optimization" to only ever mean "global optimization". A systematic way of finding the local minima of a loss function is pretty much the definition of "optimization".

Add Comment
[-]Cedar3y10

I was trying to point attention to this fact

A systematic way of finding the local minima of a loss function

But I thought about it a bit and i realized that I misunderstood the question, so I deleted my answer

Reply
Rendering 2/4 comments, sorted by
top scoring
(show more)
Click to highlight new comments since: Today at 9:12 AM
[-]ACrackedPot3y70

There's a phenomenon in multidimensional motion called "gimbal locking", in which the number of effective dimensions decrease over time under motion owing to local correlations between the dimensions, which I believe may be relevant here.

Reply
[-]Cedar3y10

https://www.lesswrong.com/posts/Hna2P8gcTyRgNDYBY/race-along-rashomon-ridge

This feels related. This also talks about paths in hyperparameter space, but instead of linear paths it talks about paths consisting of optimal models between two optimal models.

Reply
Moderation Log
More from Thomas Kwa
View more
Curated and popular this week
A
3
2

When training (to convergence) using gradient descent in high dimensions, it's common for there to be monotonically decreasing loss on the linear path between the initialization weights and the local minimum found, even if the actual path followed by training is highly nonlinear. Is this true for high-dimensional spaces in general?

  • Does the energy of a protein conformation decrease monotonically when you move on a linear path in the high-dimensional folding-space (parameterized by the angles between all the amino acids) between the unfolded configuration and the folded configuration?
  • When a food is developed from a simpler food, does an intermediate food halfway in all ingredient- and spice- concentrations taste better than the simpler food?
  • It seems like when going from a random image to an image of a dog on a linear path, the image looks monotonically more like a dog

If this doesn't usually happen, what's the underlying fact about the high-dimensional space which determines whether monotonically decreasing loss holds?