α — LessWrong

But if, for any parameter, there’s some probability p that it’s an inconvenient valley instead of a convenient hill, in order to get stuck you need to have ten thousand valleys.

I don't understand this part. Is this the probability a given point in parameter space being an optima (min or max)? Is this the probability of a point, given that it is a critical or near-critical point, being an optima?

Instead of hills or valleys, it seems like the common argument is in favor of most critical points in deep neural networks being saddle points, and a fair amount of analysis has gone into what to do about that.

This paper argues that the issue is saddle points https://arxiv.org/pdf/1406.2572.pdf but given how it's been three years and those methods have not been widely adopted, I don't think it's really that much of an issue.

Most of how modern gradient descent is accomplished (e.g. momentum) tends to steamroll over a lot of these problems. Distill has a beautiful set of interactive explanations on how and why momentum affects the gradient descent process here: https://distill.pub/2017/momentum/ I'd highly recommend checking it out.

Additionally, for many deep neural network problems we explicitly don't want the global optima! This usually corresponds to dramatically overfitting the training distribution/dataset.

LDL 2: Nonconvex Optimization

α8y40

Learning Deep Learning: Joining data science research as a mathematician

α8y30

For what it's worth I found the "re-implement backprop" to be extremely useful in developing a gears-level model of what was going on under the hood.

Andre Karpathy's "A Hacker's Guide to Neural Networks" is really good, and I think focuses on getting a good intuitive understanding of what's going on: https://karpathy.github.io/neuralnets/

I've also found Coursera and other MOOCs in the past somewhat watered down, but YMMV.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments