LESSWRONG
LW

1514
Ketil M
3020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Paradoxical Advice Thread
Ketil M6y10

Early bird gets the work, but the second mouse gets the cheese. (From Steven Pinker, I think, not sure if it's original)

Reply
Why Gradients Vanish and Explode
Ketil M6y40

I think the problem with vanishing gradients is usually linked to repeated applications of the sigmoid activation function. The gradient in backpropagation is calculated from the chain rule, where each factor d\sigma/dz in the "chain" will always be less than zero, and close to zero for large or small inputs. So for feed-forward network, the problem is a little different from recurrent networks, which you describe.

The usual mitigation is to use ReLU activations, L2 regularization, and/or batch normalization.

A minor point: the gradient doesn't necessarily tend towards zero as you get closer to a local minimum, that depends on the higher order derivatives. Imagine a local minimum at the bottom of a funnel or spike, for instance - or a very spiky fractal-like landscape. On the other hand, a local minimum in a region with a small gradient is a desirable property, since it means small perturbations in the input data doesn't change the output much. But this point will be difficult to reach, since learning depends on the gradient...

(Thanks for the interesting analysis, I'm happy to discuss this but probably won't drop by regularly to check comments - feel free to email me at ketil at malde point org)

Reply
No posts to display.