Early bird gets the work, but the second mouse gets the cheese. (From Steven Pinker, I think, not sure if it's original)
I think the problem with vanishing gradients is usually linked to repeated applications of the sigmoid activation function. The gradient in backpropagation is calculated from the chain rule, where each factor d\sigma/dz in the "chain" will always be less than zero, and close to zero for large or small inputs. So for feed-forward network, the problem is a little different from recurrent networks, which you describe.
The usual mitigation is to use ReLU activations, L2 regularization, and/or batch normalization.
A minor point: the gradient doesn... (read more)