Stochastic gradient descent
Stochastic gradient descent (SGD) is an implementation of gradient descent which approximates the real gradient of the loss function.
SGD is computed by taking into account all the training examples, with an approximated gradient which is calculated by iteratively taking a single training example at a time until it has gone through all training examples.
When to change optimizer & optimizer parameters
Once you have settled on the overall model structure but want to achieve an even better model it can be appropriate to test another optimizer.
This is classic hyper parameter fine tuning where you try and see what works best. Any of these optimizers may achieve superior results, though getting there can sometimes require a lot of tuning of other Run settings parameters, for example, learning rate.
To tweak the SGD optimizer, you can adjust these parameters:
The learning rate is controlling the size of the update steps along the gradient. This parameter sets how much of the gradient you update with, where 1 = 100% but normally you set much smaller learning rate, e.g., 0.001.
In our rolling ball analogy, we’re calculating where the ball should roll next in discrete steps (not continuous). How long these discrete steps are is the learning rate.
Choosing a good learning rate is important when training a neural network. The best learning rate is dependent on the individual problem & model. If the ball rolls carefully with a small learning rate we can expect to make consistent but very small progress (this corresponds to having a small learning rate). The risk though is that the ball gets stuck in a local minima not reaching the global minima.
Larger learning rate mean that the weights are changed more every iteration, so that they may reach their optimal value faster, but may also miss the exact optimum.
Smaller learning rate mean that the weights are changed less every iteration, so it may take longer time to reach their optimal value, but they are less likely to miss optima of the loss function.
Learning rate scheduling allows you to use large steps during the first few epochs, then progressively reduce the step size as the weights come closer to their optimal value.
Momentum is a method that helps accelerate the optimizer in the relevant direction and dampens oscillations. The momentum term increases for dimensions whose gradients point in the same directions, and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.
The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way. If we don’t use momentum the ball gets no information on where it was before each discrete calculation step. Without momentum, each new calculation will only be based on the gradient, no history.
In this way, momentum helps the optimizer not to get stuck in local minima.
This is an excellent resource for an in-depth walkthrough of momentum, Why Momentum Really Works.