All of Ketil M's Comments + Replies

Paradoxical Advice Thread

Early bird gets the work, but the second mouse gets the cheese. (From Steven Pinker, I think, not sure if it's original)

1jmh3y
Was thinking of the same comment. Heard it from my nephew when he was living in my house shortly after he graduated. FYI -- https://quoteinvestigator.com/2013/01/25/second-mouse/ [https://quoteinvestigator.com/2013/01/25/second-mouse/] , the quip appears to go back to at least 1994.
Why Gradients Vanish and Explode

I think the problem with vanishing gradients is usually linked to repeated applications of the sigmoid activation function. The gradient in backpropagation is calculated from the chain rule, where each factor d\sigma/dz in the "chain" will always be less than zero, and close to zero for large or small inputs. So for feed-forward network, the problem is a little different from recurrent networks, which you describe.

The usual mitigation is to use ReLU activations, L2 regularization, and/or batch normalization.

A minor point: the gradient doesn&#x... (read more)

2Matthew Barnett3y
That's what I used to think too. :) If you look at the post above, I even linked to the reason why I thought that. In particular, vanishing gradients was taught as intrinsically related to the sigmoid function in page 105 in these lecture notes [https://people.eecs.berkeley.edu/~jrs/papers/machlearn.pdf], which is where I initially learned about the problem. However, I no longer think gradient vanishing is fundamentally linked to sigmoids or tanh activations. I think that there is probably some confusion in terminology, and some people use the the words differently than others. If we look in the Deep Learning Book [https://www.deeplearningbook.org/], there are two sections that talk about the problem, namely section 8.2.5 and section 10.7, neither of which bring up sigmoids as being related (though they do bring up deep weight sharing networks). Goodfellow et al. cite Sepp Hochreiter's 1991 thesis as being the original document describing the issue, but unfortunately it's in German so I cannot comment as to whether it links the issue to sigmoids. Currently, when I Ctrl-F "sigmoid" on the Wikipedia page for vanishing gradients [https://en.wikipedia.org/wiki/Vanishing_gradient_problem], there are no mentions. There is a single subheader which states, "Rectifiers [https://en.wikipedia.org/wiki/Rectifier_(neural_networks)] such as ReLU suffer less from the vanishing gradient problem, because they only saturate in one direction." However, the citation for this statement comes from this paper [http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf] which mentions vanishing gradients only once and explicitly states, (Note: I misread the quote above -- I'm still confused). I think this is quite strong evidence that I was not taught the correct usage of vanishing gradients. Interesting you say that. I actually wrote a post on rethinking batch normalization [https://www.lesswrong.com/posts/aLhuuNiLCrDCF5QTo/rethinking-batch-normalization] , and I no longer think it