The distribution of the inputs to a deep network may change after each mini-batch when the weights are updated. This can cause the learning algorithm to chase a moving target forever. This change in the distribution of inputs to layers in the network is referred by the technical name “internal covariate shift.”
Batch normalization [1] is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. This stabilizes the learning process and dramatically reduces the number of training epochs required to train deep networks. Now let us see the algorithm of Batch normalization. This is going to be slightly tricky, so take a break and read this with a fresh mind.
Algorithm
STEP 1
Let’s say we’ve got a mini-batch of a dataset like this table with n features and M samples. For any feature, there is one activation vector Ai. ex: for the first feature, the activation vector A1 is (1, 3, 5, 7, 9) for M (5) samples.
A1 | A2 | .. | .. | AN | |
S1 | 1 | ||||
S2 | 3 | ||||
.. | 5 | ||||
.. | 7 | ||||
SM | 9 |
STEP 2
For each activation vector, calculate the mean and standard deviation of the mini-batch.
STEP 3
Normalize each activation to have zero mean and unit standard deviation.
STEP 4
Unlike the input layer which requires all its normalized values to have 0 mean and unit variance, batch normalization allows its values to be shifted to a different mean and scaled to a different variance.
γ and β are trainable parameters. Thus each batch normalization is able to optimally find the best factor for itself.
STEP 5
We also keep track of the exponential moving average of mean and standard deviation.
STEP 6
The above steps are for training. Now we see how validation is done.
At first, this algorithm can be overwhelming, so maybe note it down and go through it again at the end of the day.
Why does Batch normalization work?
Well, there are 2 theories of why batch normalization work [2]. This will be relatively simpler to understand.
Internal covariate shift
Sometimes the model is fed data with a very different distribution than it was previously trained with, even though the data still conforms to the same target function.
Now the model will have to re-learn some of its features according to the new target. This slows down the training process. In other words, each layer ends up trying to learn from a constantly shifting input. Batch normalization normalizes the effect of newly coming data to retain the older feature properties in the network.
Loss and gradient smoothening
In a typical neural network loss landscape isn’t a smooth convex surface. It has sharp cliffs and flat surfaces. Thus, gradient descent could encounter an obstacle in what it thought was a promising direction to follow.
Batch normalization smoothens the loss landscape substantially by changing the distribution of network weights.
Advantages
- Model converges faster and speeds up training.
- Less sensitive to how weights are initialized and precise tuning of hyper-parameters.
- We can increase the learning rate because the batch norm reduces the effect of the outlier gradient.
- Adds regularization to training.
The disadvantage or drawback of the batch normalization layer is that it doesn’t work for small-size batches. The result has too much noise in the mean and variance.