LDL 2: Nonconvex Optimization

[-]Ben Pace8y70

(Note: to post images, hit the 'palette' icon in the text toolbar, and drag some image boxes in. They don't naturally go inline, you'll have to add a textbox *after* them to continue writing after them.)

[-]magfrump8y20

Thanks for this explanation, but tbh I'm way too lazy to add them in now.

[-]Neuroff8y20

I am sad. I really wanted to see the pictures. (I don't fully get the post as it is.)

[-]magfrump8y20

Edit: pictures added, though in retrospect I'm not sure that they really add that much to the post.

Fair enough; if your comment is at +5 or more by Monday I'll go back and figure out the formatting.

[-]roystgnr8y40

This argument doesn't seem to take into account selection bias.

We don't get into a local optimum becuase we picked a random point and wow, it's a local optimum, what are the odds!?

We get into a local optimum because we used an algorithm that specifically *finds* local optima. If they're still there in higher dimensions then we're still liable to fall into them rather than into the global optimum.

[-]magfrump8y20

The point is that they really are NOT still there in higher dimensions.

[-]roystgnr8y20

Oh, well in that case the point isn't subtlely lacking, it's just easily disproven. Given any function from I^N to R, I can take the tensor product with cos(k pi x) and get a new function from I^{N+1} to R which has k times as many non-globally-optimal local optima. Pick a decent k and iterate, and you can see the number growing exponentially with higher dimension, not approaching 0.

Perhaps there's something special about the functions we try to optimize in deep learning, a property that rules out such cases? That could be. But you've said nothing special about (or even defined) a particular class of deep learning problems, rather you've made a claim about all higher dimensional optimization problems, a claim which has an infinite number of counterexamples.

[-]magfrump8y20

I definitely intended the implied context to be 'problems people actually use deep learning for,' which does impose constraints which I think are sufficient.

Certainly the claim I'm making isn't true of literally all functions on high dimensional spaces. And if I actually cared about all functions, or even all continuous functions, on these spaces then I believe there are no free lunch theorems that prevent machine learning from being effective at all (e.g. what about those functions that have a vast number of huge oscillations right between those two points you just measured?!)

But in practice deep learning is applied to problems that humans care about. Computer vision and robotics control problems, for example, are very widely used. In these problems there are some distributions of functions that empirically exist, and a simple model of those types of problems is that they can be locally approximated over an area with positive size by taylor series at any point of the domain that you care about, but these local areas are stitched together essentially at random.

In that context, it makes sense that maybe the directional second derivatives of a function would be independent of one another and rarely would they all line up.

Beyond that I'd expect that if you impose a measure on the space of such functions in some way (maybe limiting by number of patches and growth rate of power series coefficients) that the density of functions with even one critical point would quickly approach zero, even while infinitely many such functions exist in an absolute sense.

I got a little defensive thinking about this since I felt like the context of 'deep learning as it is practiced in real life' was clear but looking back at the original post it maybe wasn't outlined in that way. Even so I think your reply feels disingenuous because you're explicitly constructing adversarial examples rather than sampling functions from some space to suggest that functions with many local optima are "common." If I start suggesting that deep learning is robust to adversarial examples I have much deeper problems.

[-]whales8y40

Hm. Thinking of this in terms of the few relevant projects I've worked on, problems with (nominally) 10,000 parameters definitely had plenty of local minima. In retrospect it's easy to see how. Saddles could be arbitrarily long, where many parameters basically become irrelevant depending on where you're standing, and the only way out is effectively restarting. More generally, the parameters were very far from independent. Besides the saddles, for example, you had rough clusters of parameters where you'd want all or none but not half to be (say) small in most situations. In other words, the problem wasn't "really" 10,000-dimensional; we just didn't know how or where to reduce dimensionality. I wonder how common that is.

[-]whales8y10

Two more thoughts: the above is probably more common in [what I intuitively think of as] "physical" problems where the parameters have some sort of geometric or causal relationship, which is maybe less meaningful for neural networks?

Also, for optimization more broadly, your constraints will give you a way to wind up with many parameters that can't be changed to decrease your function, without requiring a massive coincidence. (The boundary of the feasible region is lower-dimensional.) Again, I guess not something deep learning has to worry about in full generality.

[-]α8y40

But if, for any parameter, there’s some probability p that it’s an inconvenient valley instead of a convenient hill, in order to get stuck you need to have ten thousand valleys.

I don't understand this part. Is this the probability a given point in parameter space being an optima (min or max)? Is this the probability of a point, given that it is a critical or near-critical point, being an optima?

Instead of hills or valleys, it seems like the common argument is in favor of most critical points in deep neural networks being saddle points, and a fair amount of analysis has gone into what to do about that.

This paper argues that the issue is saddle points https://arxiv.org/pdf/1406.2572.pdf but given how it's been three years and those methods have not been widely adopted, I don't think it's really that much of an issue.

Most of how modern gradient descent is accomplished (e.g. momentum) tends to steamroll over a lot of these problems. Distill has a beautiful set of interactive explanations on how and why momentum affects the gradient descent process here: https://distill.pub/2017/momentum/ I'd highly recommend checking it out.

Additionally, for many deep neural network problems we explicitly don't want the global optima! This usually corresponds to dramatically overfitting the training distribution/dataset.

[-]magfrump8y20

Instead of hills or valleys, it seems like the common argument is in favor of most critical points in deep neural networks being saddle points

I agree, the point of the digression is that a saddle point is a hill in one direction and a valley in the other.

The point is that because it's a hill in at least one direction a small perturbation (like the change in your estimate of a cost function from one mini-batch to the next) gets you out of it so it's not a problem.

Is this the probability of a point, given that it is a critical or near-critical point, being an optima?

there p is the probability that, given a near-critical point, that in a given direction that criticality is hill-like or valley-like. If any of the directions are hill-like you can roll down those directions so you need your critical points to be very valley-like. It's a stupid computation that isn't actually well defined (the probability I'm estimating is dumb and I'm only considering one critical point when I should be asking how many points are "near critical" and factoring that in, among other things) so don't worry too much about it!

[-]Elo8y20

excellently explained! I previously had limited knowledge around all of these areas but now I undestand why it's unlikely for deep learning to get stuck in local peaks.

LESSWRONG
LW

LESSWRONG
LW

13

LDL 2: Nonconvex Optimization

13

13

Coda