You may want to look at what happens with test data never shown to the network or used to make decisions about its training. Pruning often improves generalization when data are abundant compared to the complexity of the problem space because you are reducing the number of parameters in the model.

Reply

[-]aog3yΩ010

“…nodes with the smallest standard deviation.” Does this mean nodes whose weights have the lowest absolute values?

Reply

[-]Donald Hobson3yΩ130

Not quite. It means running the network on the training data. For each node, look at the values. (which will always be , as the activation function is relu) and taking the empirical standard deviation. So consider the randomness to be a random choice of input datapoint.

Reply

[-]aog3yΩ010

Ah okay. Are there theoretical reasons to think that neurons with lower variance in activation would be better candidates for pruning? I guess it would be that the effect on those nodes is similar across different datapoints, so they can be pruned and their effects will be replicated by the rest of the network.

Reply

[-]Donald Hobson3yΩ150

Well if the node has no variance in its activation at all, then its constant, and pruning it will not change the networks behavior at all.

I can prove an upper bound. Pruning a node with standard deviation X should increase the loss by at most KX, where K is the product of the operator norm of the weight matrices. The basic idea is that the network is a libshitz function, with libshitz constant K. So adding the random noise means randomness of standard deviation at most KX in the logit prediction. And as the logit is , and an increase in $log (p)$ means a decrease in $log (1 - p)$ ,, then each of those must be perterbed by at most KX.

What this means in practice is that, if the kernals are smallish, then the neurons with small standard deviation in activation aren't effecting the output much. Of course, its possible for a neuron to have a large standard deviation and have its output totally ignored by the next layer. Its possible for a neuron to have a large standard deviation and be actively bad. Its possible for a tiny standard deviation to be amplified by large values in the kernels.

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

18

Train first VS prune first in neural networks.

18

Ω 7

18

Ω 7

What exactly is pruning.

Random Pruning

Nonrandom Pruning