Optimizers 1

Optimizers in deep learning are algorithms that adjust the model parameters to minimize a loss function. Gradient descent optimizer is the most common optimizer and you would have heard of it. There are more optimizers catering to the needs that gradient descent can not fulfill. Let’s anyways start with gradient descent.

Gradient descent optimizer

repeat until convergence:

$\theta_{j} := \theta_{j} - \alpha \frac{\partial }{\partial \theta_{j}} J(\theta)$

for j = 0 .. n

Batch gradient descent

Batch gradient descent is an optimization algorithm that updates the model’s parameters by computing the gradients of the loss function with respect to the model’s parameters for the entire training dataset. We take the average of the gradients of all the training examples and then use the mean to update our parameter.

The only drawback of this method is that if the data is too large, converging might take more time and high memory utilization.

Stochastic gradient descent

Stochastic gradient descent (SGD) is a variant of gradient descent that uses a single training example to update the model’s parameters at each iteration. This makes the algorithm very sensitive to the choice of training examples, which can sometimes lead to suboptimal solutions. However, SGD has the advantage of being computationally efficient and can also escape local minima more easily than batch or mini-batch gradient descent.

Mini-batch gradient descent

Mini-batch divides the training dataset into small subsets called mini-batches. The model’s parameters are then updated using the gradients of the loss function with respect to the model’s parameters, calculated using a single mini-batch. This allows the model to make faster progress towards the minimum of the loss function, as the gradients are calculated more frequently.

Here we need to find the optimal size of the batch such that it doesn’t overshoot global minima and doesn’t get stuck in local minima.

Now to overcome the issue or say optimize the gradient descent algorithm, there more multiple tweaks done to improvise the accuracy and learning rate of the model.

Momentum

The basic idea behind momentum is to add a term to the updates that accumulate past gradients. This term acts as a “memory” of the past gradients, which can help the optimizer to move more smoothly through the parameter space.

The first step is to track the gradient accumulation so called Velocity V.

$V(t) = \beta V(t-1) + (1 - \beta)\bigtriangledown J(\theta)$

Initially, the velocity is 0. Then, every time we calculate the gradient and keep an exponential moving average like above. The momentum optimizer uses a hyperparameter, often denoted by the Greek letter beta (β), which controls the weight of the past gradients in the updates. A high value of beta (e.g. 0.9) means that the optimizer will pay more attention to the past gradients, while a low value (e.g. 0.1) means that the optimizer will pay more attention to the current gradients.

Now for the next step:

$\theta = \theta - \alpha*V(t)$

If you observe, the momentum optimizer has just replaced the gradient with the exponential moving average of the gradient.

Nesterov accelerated gradient descent

Here we give the momentum term a kind of prescience. For the derivative of the cost function, instead of θ, we compute θ−α∗V(t), which gives us an approx of the next position of the parameter. So the velocity term is modified as,

$V(t) = \beta V(t-1) + (1 - \beta)\bigtriangledown J(\theta - \beta V(t-1))$

Momentum first computes the gradient and then makes a big jump in the direction of the accumulated gradient. NAG first makes a big jump in the direction of the accumulated gradient and then makes the correction.