Batch normalization is a technique which has been successfully applied to neural networks ever since it was introduced in 2015. Empirically, it decreases training time and helps maintain the stability of deep neural networks. For that reason practitioners have adopted the technique as part of the standard toolbox.

However, while the performance boosts produced by using the method are indisputable, the underlying reason *why* batch normalization works has generated some controversy.

In this post, I will explore batch normalization and will outline the steps of how to apply it to an artificial neural network. I will cover what researchers initially suspected were the reasons why the method works. Tomorrow's post will investigate new research which calls these old hypotheses into question.

To put it in just a few sentences, batch normalization is a transformation that we can apply at each layer of a neural network. It involves normalizing the input of a layer by dividing the layer input by the activation standard deviations and subtracting the activation mean. After batch normalization is applied it is recommended to apply an additional transformation to the layer with learned parameters which allow the neural network to learn useful representations of the input. All of these steps are then incorporated into the backpropagation algorithm.

The mechanics of batch normalization can be better understood with an example. Here, I have illustrated a simple feed-forward neural network. Our goal is to apply batch normalization to the hidden layer, indicated in green.

Let the vector stand for the input to the hidden layer. The input is processed in the layer by applying an activation function element-wise to the vector computed from the previous layer. Let stand for a mini-batch of activations for the hidden layer, each row corresponding to one example in the mini-batch.

What batch normalization does is subtract the activation unit mean value from each input to the hidden layer, and divides this expression by the activation unit standard deviation. For a single unit, we replace with

where is the mean input value for across the mini-batches. In symbolic form, in order to calculate we compute

similarly, we calculate by computing

The above expression is the standard deviation for the input to the th activation unit with an additional constant value . This delta component is kept at a small positive value, like , and is added only to avoid the gradient becoming undefined where the true standard deviation is zero.

At test time, we can simply use the running averages for and discovered during training, as mini-batch samples will not always be available.

The above computations put the distribution of the input to a layer into a regime where the gradients for each layer are all reasonably sized. This is useful for training because we don't want the gradient descent step to vanish or blow up. Batch normalization accomplishes this because the weights no longer have an incentive to grow to extremely large or small values. In the process, batch normalization therefore also increases our ability to train with activations like the sigmoid, which were previously known to fall victim to vanishing gradients.

I will borrow an example from the Deep Learning Book (section 8.7.1) to illustrate the central issue, and how we can use batch normalization to fix it. Consider a simple neural network consisting of one neuron per layer. We denote the length of this network by .

Imagine that we designed this neural network such that it did not have an activation function at each step. If we chose this implementation then the output would be . Suppose we were to subtract a gradient vector obtained in the course of learning. The new value for will now be . Due to the potential depth of this neural network, the gradient descent step could now have altered the function in a disastrous way. If we expand the product for the updated expression, we find that there are *n-*order terms which could blow up if they are too large. One of these *n-*order terms is . This expression is now subtracted from , which can cause an issue. In particular, if the terms from to are all greater than one, then this previous expression becomes exponentially large.

Since a small mistake in choosing the learning rate can result in an exponential blow up, we must choose the the rate at which we propagate updates wisely. And since this network is so deep, the effects of an update to one layer may dramatically affect the other layers. For example, whereas an appropriate learning rate at one layer might be , this might simultaneously cause a vanishing gradient at some another layer!

Previous approaches to dealing with this problem focused on adjusting at each layer in order to ensure that the effect of the gradient was small enough to cancel out the large product, while remaining large enough to learn something useful. In practice, this is quite a difficult problem. The *n*-order terms which affect the output are too numerous for any reasonably quick model to take into account all of them. By using this technique, the only options we have left are to shrink the model so that there are few layers, or to slow down our gradient computation excessively.

The above difficulty of coordinating gradients between layers is really a specific case of a more general issue which arises in deep neural networks. In particular, the issue is termed an *internal covariate shift* by the original paper. In general a *covariate shift* refers to a scenario in which the input distribution for some machine learning model changes. Covariate shifts are extremely important to understand for machine learning because it is difficult to create a model which can generalize beyond the input distribution that it was trained on. Internal covariate shifts are covariate shifts that happen within the model itself.

Since neural networks can be described as function compositions, we can write a two layer feedforward neural network as where is the input to the network, and defines the parameters at each layer. Writing this expression where we obtain . We can see, therefore, that the final layer of the network has an input distribution defined by the output of the first layer, . Whenever the parameters and are modified simultaneously, then has experienced an internal covariate shift. This shift is due to the fact that now has a different output distribution.

It's as if after being told how to change in response to its input distribution, the very ground under 's feet has changed. This has the effect of partially canceling out assumption of we are making about the gradient, which is that each element of the gradient is defined as the the rate of change of a parameter *with everything else held constant*. Gradients are only defined as measuring some slope over an infinitesimal region of space — and in our case, we are only estimating the gradient using stochastic mini-batch descent. This implies that we should automatically assume that this basic assumption for our gradient estimate will be false in practice. Even still, a difference in the way that we approach the gradient calculation can help alleviate this problem to a large degree.

One way to alleviate the issue would be to encourage each layer to output similar distributions across training steps. For instance, we *could* try to add a penalty to the loss function to encourage the activations from each layer to more closely resemble a Gaussian distribution, in particular the *same* Gaussian at each step during the training process, like a whitened distribution. This would have the intended effect of keeping the underlying distribution of each layer roughly similar across training steps, minimizing the downsides of internal covariate shift. However this is an unnecessarily painful approach, since it is difficult to design a loss penalty which results in the exact desired change.

Another alternative is to modify the parameters of a layer after each gradient descent step in order to point them in a direction that will cause their output to be more Gaussian. Experiments attempting this technique resulted in neural networks that would waste time repeatedly proposing an internal covariate shift only to be reset by the intervention immediately thereafter (see section 2 in the paper).

The solution that the field of deep learning has settled on is roughly to use batch normalization as described above, and to take the gradients while carefully taking into account these equations. The batch normalization directly causes the input activations to resemble a Gaussian, and since we are using backpropagation through these equations, we don't need any expensive tug of war with the parameters.

One more step is however needed in order to keep the layers from losing their representation abilities after having their output distributions normalized.

Once we have obtained the batch of normalized activations , we in fact use as the input for the layer, where and are learned scalar parameters. In total, we are applying a transform at each layer, the *Batch Normalizing Transform*. If you consider a layer computation defined by where is an activation function, is the output from the previous layer, is the weight matrix, and is the bias term, then the layer now becomes written as where is the batch transformation defined by . The bias term is removed because the distribution shift is now fully defined by .

It may seem paradoxical that after normalizing we would now alter the matrix to make its standard deviation 1 and its mean 0. Didn't we want to minimize the effect of a covariate shift? However, this new step allows more freedom in the way the input activations can be represented. From the paper,

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity.

With the new learned parameters, the layers are more expressive. With this additional parameterization the model can figure out the appropriate mean and standard deviation for the input distribution, rather than having it set to a single value automatically, or worse, having it be some arbitrary value characterized by the layers that came before.

With these properties batch normalization strikes a balance between minimizing the internal covariate shift in the neural network while keeping the representation power at each layer. All we need to do is incorporate the new learned parameters, compute the gradients via a new set of backpropagation equations and apply the normalization transformation at each layer. Now we hit run and our neural network trains faster, and better. It's that easy.

Did my explanation above make perfect sense? Is internal covariate shift really the big issue that batch normalization solves? Tomorrow I will investigate potential flaws with the reasons I gave above.

Tune in to find out what those issues might be. Or just read the paper I'll be summarizing.