Understanding Batch Normalization

How do we check empirically or otherwise whether this explanation of what batch normalization does is correct?

I am imagining this internal covariate shift thing like this: the neural network together with its loss is a function which takes parameters θ as input and outputs a real number. Large internal covariate shift means that if we choose ε>0, perform some SGD steps, get some θ, and look at the function's graph in ε-area of θ, it doesn't really look like a plane, it's more curvy like. And small internal covariate shift means that the function's graph is more like a plane. Hence gradient descent works better. Is this intuition correct?

Why does the internal covariate shift become less, even though we have μ and β terms?

About the $y = x w_{1} w_{2} w_{3} \dots w_{l}$ example, it seems to me that the main problem here is that if a gradient descent step changes the sign of an even number of weights, then it might be that the step didn't really achieve anything. Can we fix it somehow? What if we make an optimizer that allows only 1 weight to change sign at each iteration? For actual neural networks allow only weights from one layer to change sign at any given step. Which layer to choose? The one where there are most components want to change sign. (I am not sure what to do about the fact that we use activation functions and biases)

Does batch normalization really cause the distribution of activations of a neuron be more like a Gaussian? Is that like an empirical observation of what happens when a neural network with batch normalization is optimized by an SGD-like optimizer?

P.S. I like the fact that you are posting about deep learning on LessWrong. Maybe there are many rationalists who practice machine learning but they are not sure if there are other people like that on LessWrong, so they don't post about it here?

[-]Matthew Barnett6y40

How do we check empirically or otherwise whether this explanation of what batch normalization does is correct?

Great question. I should be giving a partial answer in tomorrow's post. The bare minimum we can do is check if there's a way to define internal covariate shift (ICS) rigorously, and then measure how much the technique is reducing it. What Shibani Santurkar et. al. found was

Surprisingly, we observe that networks with BatchNorm often exhibit an increase in ICS (cf. Figure 3). This is particularly striking in the case of [deep linear networks]. In fact, in this case, the standard network experiences almost no ICS for the entirety of training, whereas for BatchNorm it appears that G and G0 are almost uncorrelated. We emphasize that this is the case even though BatchNorm networks continue to perform drastically better in terms of attained accuracy and loss.

Large internal covariate shift means that if we choose ε>0, perform some SGD steps, get some θ, and look at the function's graph in ε-area of θ, it doesn't really look like a plane, it's more curvy like. And small internal covariate shift means that the function's graph is more like a plane. Hence gradient descent works better. Is this intuition correct?

Interesting. If I understand your intuition correctly, you are essentially imagining internal covariate shift to be a measure of the smoothness of the gradient (and its loss) around the parameters $θ$ . Is that correct?

In that case, you are in some sense already capturing the intuition (as I understand it) for why batch normalization really works rather than why I said it works above. The newer paper puts a more narrow spin on this, by saying roughly that the gradient around $ϵ$ has an improvement in the Lipschitzness.

Personally, I don't view internal covariate shift that way. Of course, until it's rigorously defined (which it certainly can be) there's no clear interpretation either way.

Why does the internal covariate shift become less, even though we have μ and β terms?

This was the part I understood the least, I think. But the way that I understand it is that by allowing the model to choose $μ$ and $β$ , it can choose from a variety of distributions, while maintaining structure (specifically, it is still normalized). As long as $μ$ and $β$ don't change too rapidly, I think the idea is that it shouldn't contribute too heavily towards shifting the distribution in a way that is bad.

Can we fix it somehow? What if we make an optimizer that allows only 1 weight to change sign at each iteration?

This is an interesting approach. I'd have to think about it more, and how it interacts with my example. I remember reading somewhere that researchers once tried to only change one layer at a time, but this ended up being too slow.

Does batch normalization really cause the distribution of activations of a neuron be more like a Gaussian? Is that like an empirical observation of what happens when a neural network with batch normalization is optimized by an SGD-like optimizer?

I will admit to being imprecise in the way I worded that part. I wanted a way of conveying that the transformation was intended to control the shape of the distribution, in order to make it similar across training steps. A Gaussian is a well behaved shape, which is easy for the layer to have as its distribution.

In fact the original paper responds to this point of yours fairly directly,

In reality, the transformation is not linear, and the normalized values are not guaranteed to be Gaussian nor independent, but we nevertheless expect Batch Normalization to help make gradient propagation better behaved.

As for posting about deep learning, I was just hoping that there would be enough people here who would be interested. Looks like there might be, given that you replied. :)

[-]Matthew Barnett6y*30

Some things I didn't explain about batch normalization in this post:

Why batch normalization reduces the need for regularization (see section 3.4 in the paper).

New techniques which build on batch normalization (such as layer normalization), and the corresponding limitations of batch normalization.

Things I'm not sure about:

I may have messed up my explanation of why we use the learned parameters $γ$ and $β$ . This was something I didn't quite understand well. Also, there may be an error in the way I have set up batch normalization step; in particular, I'm unsure whether I am using "input distribution" accurately and consistently.

I might have been a bit unclear for the one dimensional neural network example. If that example doesn't make sense, try reading the citation from the Deep Learning Book.

LESSWRONG
LW

LESSWRONG
LW

19

Understanding Batch Normalization

19

19