12

22nd Apr 2022

1 min read

A

3 7

12

World Modeling

Frontpage

12

New Answer

New Comment

3 Answers sorted by
top scoring

Razied

Apr 22, 2022*

50

This is certainly true for all convex loss landscapes in all dimensions, almost by definition of "convex". I don't think anyone understands very much about the properties of non-convex high-dimensional loss landscapes, but I can say that in the case of deep learning having monotonically decreasing loss on the linear path between the initialization and the end of training, the weights we obtain when we arbitrarily decide to stop gradient descent aren't anywhere close to a local minimum of the landscape. Basically all networks of any useful size get stuck at saddle points at best, or we stop training before they even get the chance to be stuck. So it might be the case that a linear path to the actual minimum would not have monotonic loss at all, it's just that high-dimensional spaces are so mindbogglingly large that we never explore far enough from the init point to have the chance to get behind a mountain in the landscape, so to speak.

Dennis Towne

Apr 22, 2022

30

Specifically for protein folding: no, it does not decrease monotonically, unless you look at it from such a large distance that you can ignore thermal noise.

Proteins fold in a soup of water and other garbage, and for anything complicated there are going to be a lot of folding steps which are only barely above the thermal noise energy. Some proteins may even "fold" by performing a near-perfect random walk until they happen to fall into a valley that makes escape unlikely.

There may even be folding steps which are slightly disfavored, eg. require energy from the environment. Thermal noise can provide this energy for long enough that a second step can occur, leading to a more stable configuration.

[-]tailcalled4y40

The folding steps aren't linear paths though.

Reply

Razied

4y

20

gradient descent is not an optimization algorithm, it's an algorithm for finding local-minima-with-somewhat-monotone-paths.

If gradient descent isn't an "optimization algorithm", then I'm not sure what those words mean, unless you restrict "optimization" to only ever mean "global optimization". A systematic way of finding the local minima of a loss function is pretty much the definition of "optimization".

[-]Cedar4y10

I was trying to point attention to this fact

A systematic way of finding the local minima of a loss function

But I thought about it a bit and i realized that I misunderstood the question, so I deleted my answer

Reply

Rendering 2/4 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 1:47 PM

[-]ACrackedPot4y70

There's a phenomenon in multidimensional motion called "gimbal locking", in which the number of effective dimensions decrease over time under motion owing to local correlations between the dimensions, which I believe may be relevant here.

Reply

[-]Cedar3y10

https://www.lesswrong.com/posts/Hna2P8gcTyRgNDYBY/race-along-rashomon-ridge

This feels related. This also talks about paths in hyperparameter space, but instead of linear paths it talks about paths consisting of optimal models between two optimal models.

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

12

[ Question ]

(When) do high-dimensional spaces have linear paths down to local minima?

12

12

3 Answers sorted by
top scoring

Apr 22, 2022*

Apr 22, 2022

4y

12

[ Question ]

(When) do high-dimensional spaces have linear paths down to local minima?

12

12

3 Answers sorted by top scoring

Apr 22, 2022*

Apr 22, 2022

4y

3 Answers sorted by
top scoring