Batch normalization is a technique which has been successfully applied to neural networks ever since it was introduced in 2015. Empirically, it decreases training time and helps maintain the stability of deep neural networks. For that reason practitioners have adopted the technique as part of the standard toolbox.

However, while the performance boosts produced by using the method are indisputable, the underlying reason *why* batch normalization works has generated some controversy.

In this post, I will explore batch normalization and will outline the steps of how to apply it to an artificial neural network. I will cover what researchers initially suspected were the reasons why the method works. Tomorrow's post will investigate new research which calls these old hypotheses into question.

To put it in just a few sentences, batch normalization is a transformation that we can apply at each layer of a neural network. It involves normalizing the input of a layer by dividing the layer input by the activation standard deviations and subtracting the activation mean. After batch normalization is applied it is recommended to apply an additional transformation to the layer with learned parameters which allow the neural network to learn useful representations of the input. All of these steps are then incorporated into the backpropagation algorithm.

The mechanics of batch normalization can be better understood with an example. Here, I have illustrated a simple feed-forward neural network. Our goal is to apply batch normalization to the hidden layer, indicated in green.

Let the vector stand for the input to the hidden layer. The input is processed in the layer by applying an activation function element-wise to the vector computed from the previous layer. Let stand for a mini-batch of activations for the hidden layer, each row corresponding to one example in the mini-batch.

What batch normalization does is subtract the activation unit mean value from each input to the hidden layer, and divides this expression by the activation unit standard deviation. For a single unit, we replace with

where is the mean input value for across the mini-batches. In symbolic form, in order to calculate we compute

similarly, we calculate by computing

The above expression is the standard deviation for the input to the th activation unit with an additional constant value . This delta component is kept at a small positive value, like , and is added only to avoid the gradient becoming undefined where the true standard deviation is zero.

At test time, we can simply use the running averages for and discovered during training, as mini-batch samples will not always be available.

The above computations put the distribution of the input to a layer into a regime where the gradients for each layer are all reasonably sized. This is useful for training because we don't want the gradient descent step to vanish or blow up. Batch normalization accomplishes this because the weights no longer have an incentive to grow to extremely large or small values. In the process, batch normalization therefore also increases our ability to train with activations like the sigmoid, which were previously known to fall victim to vanishing gradients.

I will borrow an example from the Deep Learning Book (section 8.7.1) to illustrate the central issue, and how we can use batch normalization to fix it. Consider a simple neural network consisting of one neuron per layer. We denote the length of this network by .

Imagine that we designed this neural network such that it did not have an activation function at each step. If we chose this implementation then the output would be . Suppose we were to subtract a gradient vector obtained in the course of learning. The new value for will now be . Due to the potential depth of this neural network, the gradient descent step could now have altered the function in a disastrous way. If we expand the product for the updated expression, we find that there are *n-*order terms which could blow up if they are too large. One of these *n-*order terms is . This expression is now subtracted from , which can cause an issue. In particular, if the terms from to are all greater than one, then this previous expression becomes exponentially large.

Since a small mistake in choosing the learning rate can result in an exponential blow up, we must choose the the rate at which we propagate updates wisely. And since this network is so deep, the effects of an update to one layer may dramatically affect the other layers. For example, whereas an appropriate learning rate at one layer might be , this might simultaneously cause a vanishing gradient at some another layer!

Previous approaches to dealing with this problem focused on adjusting at each layer in order to ensure that the effect of the gradient was small enough to cancel out the large product, while remaining large enough to learn something useful. In practice, this is quite a difficult problem. The *n*-order terms which affect the output are too numerous for any reasonably quick model to take into account all of them. By using this technique, the only options we have left are to shrink the model so that there are few layers, or to slow down our gradient computation excessively.

The above difficulty of coordinating gradients between layers is really a specific case of a more general issue which arises in deep neural networks. In particular, the issue is termed an *internal covariate shift* by the original paper. In general a *covariate shift* refers to a scenario in which the input distribution for some machine learning model changes. Covariate shifts are extremely important to understand for machine learning because it is difficult to create a model which can generalize beyond the input distribution that it was trained on. Internal covariate shifts are covariate shifts that happen within the model itself.

Since neural networks can be described as function compositions, we can write a two layer feedforward neural network as where is the input to the network, and defines the parameters at each layer. Writing this expression where we obtain . We can see, therefore, that the final layer of the network has an input distribution defined by the output of the first layer, . Whenever the parameters and are modified simultaneously, then has experienced an internal covariate shift. This shift is due to the fact that now has a different output distribution.

It's as if after being told how to change in response to its input distribution, the very ground under 's feet has changed. This has the effect of partially canceling out assumption of we are making about the gradient, which is that each element of the gradient is defined as the the rate of change of a parameter *with everything else held constant*. Gradients are only defined as measuring some slope over an infinitesimal region of space — and in our case, we are only estimating the gradient using stochastic mini-batch descent. This implies that we should automatically assume that this basic assumption for our gradient estimate will be false in practice. Even still, a difference in the way that we approach the gradient calculation can help alleviate this problem to a large degree.

One way to alleviate the issue would be to encourage each layer to output similar distributions across training steps. For instance, we *could* try to add a penalty to the loss function to encourage the activations from each layer to more closely resemble a Gaussian distribution, in particular the *same* Gaussian at each step during the training process, like a whitened distribution. This would have the intended effect of keeping the underlying distribution of each layer roughly similar across training steps, minimizing the downsides of internal covariate shift. However this is an unnecessarily painful approach, since it is difficult to design a loss penalty which results in the exact desired change.

Another alternative is to modify the parameters of a layer after each gradient descent step in order to point them in a direction that will cause their output to be more Gaussian. Experiments attempting this technique resulted in neural networks that would waste time repeatedly proposing an internal covariate shift only to be reset by the intervention immediately thereafter (see section 2 in the paper).

The solution that the field of deep learning has settled on is roughly to use batch normalization as described above, and to take the gradients while carefully taking into account these equations. The batch normalization directly causes the input activations to resemble a Gaussian, and since we are using backpropagation through these equations, we don't need any expensive tug of war with the parameters.

One more step is however needed in order to keep the layers from losing their representation abilities after having their output distributions normalized.

Once we have obtained the batch of normalized activations , we in fact use as the input for the layer, where and are learned scalar parameters. In total, we are applying a transform at each layer, the *Batch Normalizing Transform*. If you consider a layer computation defined by where is an activation function, is the output from the previous layer, is the weight matrix, and is the bias term, then the layer now becomes written as where is the batch transformation defined by . The bias term is removed because the distribution shift is now fully defined by .

It may seem paradoxical that after normalizing we would now alter the matrix to make its standard deviation 1 and its mean 0. Didn't we want to minimize the effect of a covariate shift? However, this new step allows more freedom in the way the input activations can be represented. From the paper,

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.

With the new learned parameters, the layers are more expressive. With this additional parameterization the model can figure out the appropriate mean and standard deviation for the input distribution, rather than having it set to a single value automatically, or worse, having it be some arbitrary value characterized by the layers that came before.

With these properties batch normalization strikes a balance between minimizing the internal covariate shift in the neural network while keeping the representation power at each layer. All we need to do is incorporate the new learned parameters, compute the gradients via a new set of backpropagation equations and apply the normalization transformation at each layer. Now we hit run and our neural network trains faster, and better. It's that easy.

Did my explanation above make perfect sense? Is internal covariate shift really the big issue that batch normalization solves? Tomorrow I will investigate potential flaws with the reasons I gave above.

Tune in to find out what those issues might be. Or just read the paper I'll be summarizing.

How do we check empirically or otherwise whether this explanation of what batch normalization does is correct?

I am imagining this internal covariate shift thing like this: the neural network together with its loss is a function which takes parameters θ as input and outputs a real number. Large internal covariate shift means that if we choose ε>0, perform some SGD steps, get some θ, and look at the function's graph in ε-area of θ, it doesn't really look like a plane, it's more curvy like. And small internal covariate shift means that the function's graph is more like a plane. Hence gradient descent works better. Is this intuition correct?

Why does the internal covariate shift become less, even though we have μ and β terms?

About the y=xw1w2w3…wl example, it seems to me that the main problem here is that if a gradient descent step changes the sign of an even number of weights, then it might be that the step didn't really achieve anything. Can we fix it somehow? What if we make an optimizer that allows only 1 weight to change sign at each iteration? For actual neural networks allow only weights from one layer to change sign at any given step. Which layer to choose? The one where there are most components want to change sign. (I am not sure what to do about the fact that we use activation functions and biases)

Does batch normalization really cause the distribution of activations of a neuron be more like a Gaussian? Is that like an empirical observation of what happens when a neural network with batch normalization is optimized by an SGD-like optimizer?

P.S. I like the fact that you are posting about deep learning on LessWrong. Maybe there are many rationalists who practice machine learning but they are not sure if there are other people like that on LessWrong, so they don't post about it here?

Great question. I should be giving a partial answer in tomorrow's post. The bare minimum we can do is check if there's a way to define internal covariate shift (ICS) rigorously, and then measure how much the technique is reducing it. What Shibani Santurkar et. al. found was

Interesting. If I understand your intuition correctly, you are essentially imagining internal covariate shift to be a measure of the smoothness of the gradient (and its loss) around the parameters θ. Is that correct?

In that case, you are in some sense already capturing the intuition (as I understand it) for why batch normalization

really worksrather than why I said it works above. The newer paper puts a more narrow spin on this, by saying roughly that the gradient around ϵ has an improvement in the Lipschitzness.Personally, I don't view internal covariate shift that way. Of course, until it's rigorously defined (which it certainly can be) there's no clear interpretation either way.

This was the part I understood the least, I think. But the way that I understand it is that by allowing the model to choose μ and β, it can choose from a variety of distributions, while maintaining structure (specifically, it is still normalized). As long as μ and β don't change too rapidly, I think the idea is that it shouldn't contribute too heavily towards shifting the distribution in a way that is bad.

This is an interesting approach. I'd have to think about it more, and how it interacts with my example. I remember reading somewhere that researchers once tried to only change one layer at a time, but this ended up being too slow.

I will admit to being imprecise in the way I worded that part. I wanted a way of conveying that the transformation was intended to control the shape of the distribution, in order to make it similar across training steps. A Gaussian is a well behaved shape, which is easy for the layer to have as its distribution.

In fact the original paper responds to this point of yours fairly directly,

As for posting about deep learning, I was just hoping that there would be

enoughpeople here who would be interested. Looks like there might be, given that you replied. :)Some things I didn't explain about batch normalization in this post:Why batch normalization reduces the need for regularization (see section 3.4 in the paper).

New techniques which build on batch normalization (such as layer normalization), and the corresponding limitations of batch normalization.

Things I'm not sure about:I may have messed up my explanation of why we use the learned parameters γ and β. This was something I didn't quite understand well. Also, there may be an error in the way I have set up batch normalization step; in particular, I'm unsure whether I am using "input distribution" accurately and consistently.

I might have been a bit unclear for the one dimensional neural network example. If that example doesn't make sense, try reading the citation from the Deep Learning Book.