When training (to convergence) using gradient descent in high dimensions, it's common for there to be monotonically decreasing loss on the linear path between the initialization weights and the local minimum found, even if the actual path followed by training is highly nonlinear. Is this true for high-dimensional spaces in general?

  • Does the energy of a protein conformation decrease monotonically when you move on a linear path in the high-dimensional folding-space (parameterized by the angles between all the amino acids) between the unfolded configuration and the folded configuration?
  • When a food is developed from a simpler food, does an intermediate food halfway in all ingredient- and spice- concentrations taste better than the simpler food?
  • It seems like when going from a random image to an image of a dog on a linear path, the image looks monotonically more like a dog

If this doesn't usually happen, what's the underlying fact about the high-dimensional space which determines whether monotonically decreasing loss holds?

New Answer
New Comment

3 Answers sorted by

This is certainly true for all convex loss landscapes in all dimensions, almost by definition of "convex". I don't think anyone understands very much about the properties of non-convex high-dimensional loss landscapes, but I can say that in the case of deep learning having monotonically decreasing loss on the linear path between the initialization and the end of training, the weights we obtain when we arbitrarily decide to stop gradient descent aren't anywhere close to a local minimum of the landscape. Basically all networks of any useful size get stuck at saddle points at best, or we stop training before they even get the chance to be stuck. So it might be the case that a linear path to the actual minimum would not have monotonic loss at all, it's just that high-dimensional spaces are so mindbogglingly large that we never explore far enough from the init point to have the chance to get behind a mountain in the landscape, so to speak. 

Specifically for protein folding:  no, it does not decrease monotonically, unless you look at it from such a large distance that you can ignore thermal noise.

Proteins fold in a soup of water and other garbage, and for anything complicated there are going to be a lot of folding steps which are only barely above the thermal noise energy.  Some proteins may even "fold" by performing a near-perfect random walk until they happen to fall into a valley that makes escape unlikely.

There may even be folding steps which are slightly disfavored, eg. require energy from the environment.  Thermal noise can provide this energy for long enough that a second step can occur, leading to a more stable configuration.

The folding steps aren't linear paths though.

gradient descent is not an optimization algorithm, it's an algorithm for finding local-minima-with-somewhat-monotone-paths.

If gradient descent isn't an "optimization algorithm", then I'm not sure what those words mean, unless you restrict "optimization" to only ever mean "global optimization". A systematic way of finding the local minima of a loss function is pretty much the definition of "optimization".

I was trying to point attention to this fact

A systematic way of finding the local minima of a loss function

But I thought about it a bit and i realized that I misunderstood the question, so I deleted my answer

2 comments, sorted by Click to highlight new comments since: Today at 4:12 PM

There's a phenomenon in multidimensional motion called "gimbal locking", in which the number of effective dimensions decrease over time under motion owing to local correlations between the dimensions, which I believe may be relevant here.

https://www.lesswrong.com/posts/Hna2P8gcTyRgNDYBY/race-along-rashomon-ridge

This feels related. This also talks about paths in hyperparameter space, but instead of linear paths it talks about paths consisting of optimal models between two optimal models.