Adam is the go-to-optimizer. It efficiently computes according to stochastic gradient descent-methods.

Adam can be viewed as a combination of RMSprop and momentum.

Momentum can be seen as a ball rolling down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the loss function landscape. The authors of the original paper empirically show that Adam works well in practice and compares favorably to other adaptive learning-method algorithms.

Figure 1. Momentum

RMSprop contributes the exponentially decaying average of past squared gradients, while momentum accounts for the exponentially decaying average of past gradients.

Adjust parameters

To tweak the Adam optimizer, you can adjust these parameters:

Learning rate

The learning rate is controlling the size of the update steps along the gradient. This parameter sets how much of the gradient you update with, where 1 = 100% but normally you set much smaller learning rate, e.g., 0.001.

In our rolling ball analogy, we’re calculating where the ball should roll next in discrete steps (not continuous). How long these discrete steps are is the learning rate.

Choosing a good learning rate is important when training a neural network. If the ball rolls carefully with a small learning rate we can expect to make consistent but very small progress (this corresponds to having a small learning rate). The risk though is that the ball gets stuck in a local minima not reaching the global minima.

Learning rate
Figure 2. Learning rate

Larger steps mean that the weights are changed more every iteration, so that they may reach their optimal value faster, but may also miss the exact optimum.
Smaller steps mean that the weights are changed less every iteration, so it may take more epochs to reach their optimal value, but they are less likely to miss optima of the loss function.

Learning rate scheduling allows you to use large steps during the first few epochs, then progressively reduce the step size as the weights come closer to their optimal value.

B and B2 rate

This settings adjust the exponential decay avarage parameters, that are used by the optimizer to gain momentum.

These settings are for optimizer Adam, Adamax, AMSgrad and Nadam.

Nesterov momentum

Standard momentum blindly accelerates down slopes, it first computes gradient, then makes a big jump. Nesterov momentum is a way to “look ahead”. It first does a big jump in the direction of the previously accumulated gradient. Then it measures where it ends up and makes a correction resulting in the complete update vector.

Nesterov momentum
Figure 3. Nesterov momentum
Was this page helpful?