Now let us say you’ve made your model but even after so many epochs, the accuracy is constant. Now you’re wondering what went wrong. Let us look at some issues or roadblocks that we face while training a deep network. The most common issue is Vanishing and exploding gradient.
Vanishing Gradient
As more and more layers are added to a neural network, the gradient of the loss approaches 0 making it impossible to train.
Activation functions like sigmoid and tanh squish large input range to small space. Even for a large change in input, there is a small change in output. By the chain rule, the derivative of each layer is multiplied down the network to compute the derivative of the initial layer.
When n layers use activation functions like sigmoid, n small derivatives are multiplied together. Thus gradient decreases exponentially and while backpropagation, the weights of initial layers are not updated.
Solution
- Use ReLU activation function.
- Use the Batch normalization layer.
- Redesign the network to reduce the number of layers.
- Weight initialization (Xavier initialization when sigmoid/tanh is used)
- Residual networks (skip-nets).
Exploding gradient
This is similar to the vanishing gradient except the gradient increases to a very large number rather than diminishing. This happens usually when we use unbounded activation functions like ReLU. The explosion occurs through exponential growth by repeatedly multiplying gradients through the network layers that have values larger than 1.0.
Solution
- Gradient clipping
- Redesign the network to reduce the number of layers.
- Use activation function with bounded range.
- Weight initialization (Kaiming/He initialization).