Adagrad – Adaptive gradient algorithm
Adagrad adapts the learning rate for each parameter individually, by dividing the learning rate by the square root of the sum of the squares of the gradients for that parameter.
The gradient at time t:
Storing square of gradients:
Optimize the learning rate by dividing with the accumulated term:
This means that Adagrad will have a higher learning rate for parameters that have not been updated as much (since the accumulated gradient for the less frequently updated parameters will be lesser) and a lower learning rate for parameters that have been updated frequently (since the accumulated gradient for the more frequently updated parameter will be higher decreasing the learning rate). This allows the algorithm to converge more quickly for sparse data. Epsilon is just for numerical stability.
Advantages
- Well suited for sparse data, since the update is more for less frequently occurring data.
- Removes the need to manually tune the learning rate.
Disadvantage
One downside of Adagrad is that the learning rate will decrease over time, which may cause the model to converge too slowly. This can be addressed by either increasing the initial learning rate or decreasing the learning rate over time.
RMS prop – Root Mean Square Propagation
In adagrad if we just add a decaying parameter i.e. moving average of the squared gradients instead of the average.
So in the storing of gradient step, it is modified to:
The use of the moving average is that it helps to prevent the learning rate from becoming too small over time, as it does with Adagrad.
Adadelta
Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size w.
Adam – Adaptive momentum estimation
The idea is to mix SGD with momentum and adaptive learning from RMS prop.
Estimates of the first moment (mean) –
Estimates of the second moment (variance) –
Bias correction –
This step helps ensure that the moving averages are accurate even at the start of the optimization process before the averages have had time to stabilize.
Optimization step –
Advantages
- This method is fast and converges rapidly.
- Recifies vanishing learning rate and high variance.
Pingback: Logistic Regression - akashnotes