The Lipschitz constant of a function gives an indication of how horizontal it is rather than how locally linear it is. Naively I'd expect that the second of those things matters more than the first. Has anyone looked at what batch normalization does to that?

More specifically: Define the 2-Lipschitz constant of function $f$ at $x$ to be something like $inf {a_{2} : \exists a_{0}, a_{1} : | | f (x + u) - a_{0} - a 1. u) | | \leq \frac{1}{2} a_{2} | | u | |^{2}}$ and its overall 2-Lipschitz constant to be the sup of these. This measures how well $f$ is locally approximable by linear functions. (I expect someone's already defined a better version of this, probably with a different name, but I think this'll do.) Does batch normalization tend to reduce the 2-Lipschitz constant of the loss function?

[EDITED to add:] I think having a 2-Lipschitz constant in this sense may be equivalent to having a derivative which is a Lipschitz function (and the constant may be its Lipschitz constant, or something like that). So maybe a simpler question is: For networks with activation functions making the loss function differentiable, does batchnorm tend to reduce the Lipschitz constant of its derivative? But given how well rectified linear units work, and that they have a non-differentiable activation function (which will surely make the loss functions fail to be 2-Lipschitz in the sense above) I'm now thinking that if anything like this works it will need to be more sophisticated...

Reply

[-]Matthew Barnett6y10

The Lipschitz constant of a function gives an indication of how horizontal it is rather than how locally linear it is. Naively I'd expect that the second of those things matters more than the first. Has anyone looked at what batch normalization does to that?

Yeah, in fact I should have been more clear in the post. A very simple way of reducing the Lipschitzness of a function is by simply scaling it by some constant factor. The original paper attempts to show theoretically that batchnorm is doing more than simply scaling. See theorem 4.2 in the paper, and the subsequent observation in section 4.3.

If you think about it though, we can already kind of guess that batch normalization isn't simply scaling the function. That's because we measured the gradient predictiveness and discovered that the gradient ended up being much closer to the empirically observed delta-loss than when batch normalization was not enabled. This gives us evidence that the function is locally linear in the way that you described (of course, this can be criticized if you disagree with the way that they measured gradient predictiveness, which focused on measuring the variability of gradient minus actual difference in loss (see figure 4 in the paper)).

Does batch normalization tend to reduce the 2-Lipschitz constant of the loss function?

That's a good question. My guess would be yes due to what I said above, but I am not in a position confidently to say either way. I would have to think more about the exact way that you have defined it. :)

Reply

[-]Pattern6y30

And to top that off, they found that even in networks where they artificially increased ICS, performance barely suffered.

All networks, or just ones with batch normalization?

Reply

[-]Matthew Barnett6y30

That's a good point of clarification which perhaps weakens the point I was making there. From the paper,

adding the same amount of noise to the activations of the standard (non-BatchNorm) network prevents it from training entirely

Reply

[-]philip_b6y10

I want to clarify in what domain this theory of batch normalization holds.

The evidence we have is mostly about batch normalization in those types of feedforward neural networks that are often used in 2019, right? So, residual CNNs, VGG-like CNNs, other CNNs, transformers. Maybe other types of feedforward neural networks. But not RNNs.

Has anyone explored the applicability of batch normalization or similar techniques to non neural network functions which we optimize by gradient descent like algorithms? Perhaps to tensor networks?

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

20

Rethinking Batch Normalization

20

20