Optimizers and compiler options

Deep learning is ultimately about making an as good prediction as possible. We measure how well our network is performing by computing a loss function over our data. For example, for the loss function mean average error (MAE) we measure the absolute difference between our prediction and the actual value.

The lower the value of the loss function, the better the model performs. If our predictions are totally off, your loss function will output a higher number. If they’re pretty good, it’ll output a lower one. The goal is to find a minima to the loss function that generalizes well. Generalization is as the ability to correctly predict new, previously unseen data points.

Our strategy to minimize the loss is to start with random weights and then iteratively refine the weights over time to get a lower loss, this is called training.

On the Platform you can choose from several different training strategies to get a lower loss, i.e., optimizers, for example, Adam, SGD, and RMSprop.

The optimizers can be tuned by using different parameters such as batch size and learning rate.

Ball rolling in an artificial landscape

The loss function, defined by the parameters and the data, can be looked at as an artificial landscape. At every point in the landscape, we achieve a particular loss which is the height of the terrain. The landscape has valleys and hills where the valleys are minimas of the loss function.

The optimizer works like a ball rolling downhill trying to reach the deepest valley where the loss is as low as possible. The trick is to find the best strategy without risking the ball getting stuck in a local minima, a shallow valley, and not reaching the global minima, the deepest valley.

Ball rolling in an artificial landscape
Figure 1. Ball rolling in an artificial landscape

Gradient descent

The basic algorithm we will use to find the lowest loss is gradient descent.

The gradient tells us the slope of the loss function along every dimension and thus gives the steepest ascent direction. We want to descend into the deepest valley, therefore we’ll follow the gradient in the opposite direction, this is called gradient descent. In our rolling ball analogy, this approach corresponds to the direction the ball will rolls.

Gradient descent
Figure 2. Gradient descent

Gradient descent in a neural network with backpropagation

To compute the gradient over all parameters in a neural network we use a method popularized under the name backpropagation.

Forward pass

Backpropagation first computes all activations for a batch of data and a set of parameters in a forward pass through the network.

Loss is computed

The loss for the batch is computed at the end of the network when the final output is available for that data and set of parameters.

Backward pass

The loss is then used to determine how to move the parameters in the last layer to achieve a lower loss.

The information how to achieve a lower loss is then, in turn, propagated backward through the network and used in the same way for all earlier layers. This phase called the backward pass, where the name comes from this information being passed backward.

Update parameters

After the backward pass, we have an "update step", where the parameters are actually updated to (hopefully) lower the loss.

Batch size

Batch size is how many samples that should be calculated at the same time. You could take everything from all samples (full batch) to only one sample at a time (online). On the Platform we’ve set the default batch size to 32 samples. This is the batch size that has empirically been proved to be a good choice.

Example: You have 2560 training samples and you want to set the batch size to 32. The experiment then takes the first 32 samples (1 to 32) from the training dataset and train the network. Then it takes the next 32 samples (from 257 to 512). Then the next 32 …

When backpropagation first was invented people were trying to find the best batch size. The question was, is full batch the best or only one sample at a time? The problem was that, on one hand, gradient descent becomes unnecessarily slow when the number of samples grows, but on the other hand, when the batch size is very small the gradient estimates becomes very noisy and we have to use a small learning rate, which in turn makes progress slow. Another negative thing with very small batch size is that the GPUs become underutilized. Generalization error is often best for a batch size of 1.

The larger batch size we have, the better estimate we will get of the true gradients. However, the batch size is usually limited by GPU memory.

Now the most common thing is to use mini-batches, a kind of middle way between full batch and online.

Learning rate

The learning rate is controlling the size of the update steps along the gradient. This parameter sets how much of the gradient you update with, where 1 = 100% but normally you set much smaller learning rate, e.g., 0.001.

In our rolling ball analogy, we’re calculating where the ball should roll next in discrete steps (not continuous). How long these discrete steps are is the learning rate.

Choosing a good learning rate is important when training a neural network. If the ball rolls carefully with a small learning rate we can expect to make consistent but very small progress (this corresponds to having a small learning rate). The risk though is that the ball gets stuck in a local minima not reaching the global minima.

Learning rate
Figure 3. Learning rate

We could also choose to take long confident discrete steps in an attempt to descend faster and avoid local minima, but this may not pay off. At some point, calculating too seldom gives a higher loss as we “overstep”, we overshoot the minima.

For a deeper dive in optimizers, you should read Sebastian Ruder’s excellent blog post An overview of gradient descent optimization algorithms and the module Optimization: Stochastic Gradient Descent in CS231n Convolutional Neural Networks for Visual Recognition.


Momentum is a method that helps accelerate the optimizer in the relevant direction and dampens oscillations. The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.

The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way. If we don’t use momentum the ball gets no information on where it was before each discrete calculation step. Without momentum, each new calculation will only be based on the gradient, no history.

In this way, momentum helps the optimizer not to get stuck in local minima.

Figure 4. Momentum

This is an excellent resource for an in-depth walkthrough of momentum, Why Momentum Really Works.

Nesterov momentum

Standard momentum blindly accelerates down slopes, it first computes gradient, then makes a big jump. Nesterov momentum is a way to “look ahead”. It first makes a big jump in the direction of the previously accumulated gradient. Then it measures where it ends up and makes a correction resulting in the complete update vector.

Nesterov momentum
Figure 5. Nesterov momentum
Try the platform