Optimizers 2

Adagrad – Adaptive gradient algorithm

Adagrad adapts the learning rate for each parameter individually, by dividing the learning rate by the square root of the sum of the squares of the gradients for that parameter.

The gradient at time t:

$g_{t, i} = \bigtriangledown_{\theta}J(\theta_{t, i})$

Storing square of gradients:

$G_{t, i} = G_{t-1, i} + (g_{t-1, i})^{2}$

Optimize the learning rate by dividing with the accumulated term:

$\theta_{t+1, i} = \theta_{t, i} - \frac{\alpha}{\sqrt{G_{t, i} + \epsilon}}.g_{t, i}$

This means that Adagrad will have a higher learning rate for parameters that have not been updated as much (since the accumulated gradient for the less frequently updated parameters will be lesser) and a lower learning rate for parameters that have been updated frequently (since the accumulated gradient for the more frequently updated parameter will be higher decreasing the learning rate). This allows the algorithm to converge more quickly for sparse data. Epsilon is just for numerical stability.

Advantages

Well suited for sparse data, since the update is more for less frequently occurring data.
Removes the need to manually tune the learning rate.

Disadvantage

One downside of Adagrad is that the learning rate will decrease over time, which may cause the model to converge too slowly. This can be addressed by either increasing the initial learning rate or decreasing the learning rate over time.

RMS prop – Root Mean Square Propagation

In adagrad if we just add a decaying parameter i.e. moving average of the squared gradients instead of the average.

So in the storing of gradient step, it is modified to:

$G_{t, i} = \beta * G_{t-1, i} + (1 - \beta) * (g_{t-1, i})^{2}$

The use of the moving average is that it helps to prevent the learning rate from becoming too small over time, as it does with Adagrad.

Adadelta

Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size w.

Adam – Adaptive momentum estimation

The idea is to mix SGD with momentum and adaptive learning from RMS prop.

Estimates of the first moment (mean) –

$m_{t} = \beta_{1} m_{t-1} + (1 - \beta_{1})\bigtriangledown w_{t}$

Estimates of the second moment (variance) –

$v_{t} = \beta_{2} v_{t-1} + (1 - \beta_{2})(\bigtriangledown w_{t})^{2}$

Bias correction –

$\hat{m_{t}} = \frac{m_{t}}{1 - \beta_{1}^{t}}$

$\hat{v_{t}} = \frac{v_{t}}{1 - \beta_{2}^{t}}$

This step helps ensure that the moving averages are accurate even at the start of the optimization process before the averages have had time to stabilize.

Optimization step –

$\theta_{t+1} = \theta_{t} - \frac{\alpha}{\sqrt{\hat{v_{t}} + \epsilon}}.\hat{m_{t}}$

Advantages

This method is fast and converges rapidly.
Recifies vanishing learning rate and high variance.

Adagrad – Adaptive gradient algorithm

Advantages

Disadvantage

RMS prop – Root Mean Square Propagation

Adadelta

Adam – Adaptive momentum estimation

Advantages

1 thought on “Optimizers 2”

Leave a Reply Cancel reply