Optimization principles (in deep learning)
This chapter will present tools for various training strategies that serves the purpose of optimizing the quality of the training. We will also go in-depth about how deep learning is achieved.
In the settings, the choice of optimizer decides attributes of the algorithm related to Gradient descent. The aim is to improve the predictions by lowering the loss.
To be able to tweak the settings, understanding Gradient descent is important. If you’re not in the mood for some preparatory theory, go directly to relevant headline!
Optimization with labeled training
Although there are a few ways to train neural networks, models on the platform are trained using labeled training. This means that you need to provide a examples of input features – and the correct answer that the model should predict for each of them, namely, the label.
Models have weights, sometimes referred to as parameters, which are the coefficients of its mathematical functions.
During training, the model weights are optimized so that the output prediction is as close as possible to the provided label (On our platform, label is synonymous to target). Deep networks have many (thousands to millions) weights. Optimization is possible thanks to the gradient descent method. Here is the principle.
We measure how well our network is performing – the quality of its predicions – by computing the model loss. The loss is a measure of how much error the model is committing, averaged over all training examples.
The loss is calculated from the difference between the model output and the provided label using a loss function. Different loss functions have different properties, which make them more or less appropriate for different use cases.
With every loss function, the loss is positive and large when the model makes much error. To optimize the model, we look at the loss and try to minimize it i.e. as close to 0 as possible.
Minimizing the loss
A very common analogy is that optimizing a model is like rolling a ball on an artificial landscape. In this analogy, the coordinates (x-position in the figure) represent different sets of model weights:
The altitude of the hill corresponds to the loss, as calculated by a particular loss function.
The ball is where the model is located at any given time.
The ball is moved around step by step in an iterative manner.
To compute the gradient of the loss over every parameter (i.e. how much it decreases for every possible direction in the landscape) in a neural network we use a method called backpropagation.
Backpropagation starts from the model loss, and calculates its derivative with respect to the weights that are in the last model layer. Derivatives are then calculated with respect to every other layer, working backward from the last to the first layer of the model.
Once the gradient of the loss is known for every possible direction, i.e. variation of every weight, the model weights are updated in the direction that provides the largest decrease of the loss. In the analogy, this is pushing the ball in the direction of steepest descent.
Although this principle is always the same, how the gradient descent is implemented numerically depends on the optimizer. Different optimizers may have different behavior with different types of problems which lead to faster or better decrease of the loss, but we find that Adam optimizer is a good all-round optimizer.
Batch size determines how many samples that should be calculated at the same time. The larger batch size we have, the better estimate we will get of the true gradients. However, the batch size is usually limited by GPU memory.
When backpropagation first was invented people were trying to find the best batch size. The question was, is full batch the best or only one sample at a time?
The problem was that, on one hand, gradient descent becomes unnecessarily slow when the number of samples grows, but on the other hand, when the batch size is very small the gradient estimates becomes very noisy and we have to use a small learning rate, which in turn makes progress slow.
Another negative thing with very small batch size is that the GPUs become under-utilized. Generalization error is often best for a batch size of 1.
The learning rate is controlling the size of the update steps along the gradient. This parameter sets how much of the gradient you update with, where 1 = 100% but normally you set much smaller learning rate, e.g., 0.001.
In our rolling ball analogy, we’re calculating where the ball should roll next in discrete steps (not continuous). How long these discrete steps are is the learning rate.
Choosing a good learning rate is important when training a neural network. If the ball rolls carefully with a small learning rate we can expect to make consistent but very small progress (this corresponds to having a small learning rate). The risk though is that the ball gets stuck in a local minima not reaching the global minima.
Larger steps mean that the weights are changed more every iteration, so that they may reach their optimal value faster, but may also miss the exact optimum.
Smaller steps mean that the weights are changed less every iteration, so it may take more epochs to reach their optimal value, but they are less likely to miss optima of the loss function.
Learning rate scheduling allows you to use large steps during the first few epochs, then progressively reduce the step size as the weights come closer to their optimal value.
B and B2 rate
p (epsilon) is called fuzz or decay factor.
Momentum is a method that helps accelerate the optimizer in the relevant direction and dampens oscillations. The momentum term increases for dimensions whose gradients point in the same directions, and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.
The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way. If we don’t use momentum the ball gets no information on where it was before each discrete calculation step. Without momentum, each new calculation will only be based on the gradient, no history.
In this way, momentum helps the optimizer not to get stuck in local minima.
This is an excellent resource for an in-depth walkthrough of momentum, Why Momentum Really Works.
Standard momentum blindly accelerates down slopes, it first computes gradient, then makes a big jump. Nesterov momentum is a way to “look ahead”. It first does a big jump in the direction of the previously accumulated gradient. Then it measures where it ends up and makes a correction resulting in the complete update vector.
For a deeper dive in optimizers, you should read Sebastian Ruder’s excellent blog post: An overview of gradient descent optimization algorithms and the module Optimization: Stochastic Gradient Descent in CS231n Convolutional Neural Networks for Visual Recognition.