New to Peltarion? Discover our deep learning Platform

A single deep learning platform to build and deploy your projects, even if you’re not an AI superstar.


Stochastic gradient descent

Stochastic gradient descent (SGD) is an implementation of gradient descent which approximates the real gradient of the loss function. It’s computed by taking into account all the training examples, with an approximated gradient which is calculated by iteratively taking a single training example at a time until it has gone through all training examples.

To tweak the SGD optimizer, you can adjust these parameters:

Learning rate

The learning rate is controlling the size of the update steps along the gradient. This parameter sets how much of the gradient you update with, where 1 = 100% but normally you set much smaller learning rate, e.g., 0.001.

In our rolling ball analogy, we’re calculating where the ball should roll next in discrete steps (not continuous). How long these discrete steps are is the learning rate.

Choosing a good learning rate is important when training a neural network. If the ball rolls carefully with a small learning rate we can expect to make consistent but very small progress (this corresponds to having a small learning rate). The risk though is that the ball gets stuck in a local minima not reaching the global minima.

Learning rate
Figure 1. Learning rate

We could also choose to take long confident discrete steps in an attempt to descend faster and avoid local minima, but this may not pay off. At some point, calculating too seldom gives a higher loss as we “overstep”, we overshoot the minima.

Learning rate decay

The value here defines the process of gradually decreasing the learning rate during training, in order to help speed up its steps along the gradient.


Momentum is a method that helps accelerate the optimizer in the relevant direction and dampens oscillations. The momentum term increases for dimensions whose gradients point in the same directions, and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.

The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way. If we don’t use momentum the ball gets no information on where it was before each discrete calculation step. Without momentum, each new calculation will only be based on the gradient, no history.

In this way, momentum helps the optimizer not to get stuck in local minima.

Figure 2. Momentum

This is an excellent resource for an in-depth walkthrough of momentum, Why Momentum Really Works.

Nesterov momentum

Standard momentum blindly accelerates down slopes, it first computes gradient, then makes a big jump. Nesterov momentum is a way to “look ahead”. It first does a big jump in the direction of the previously accumulated gradient. Then it measures where it ends up and makes a correction resulting in the complete update vector.

Nesterov momentum
Figure 3. Nesterov momentum