We recommend that you use Adam when you start to experiment. This optimizer is usually forgiving and allows you to get reasonable results without too much tuning.
Once you have a settled on the overall model structure but want to achieve an even better model it can be appropriate to test another optimizer. A good recommendation here is to use regular stochastic gradient descent with momentum (Nesterov or standard). Such an optimizer may achieve superior results, though getting there can sometimes require a lot of tuning.
This illustration of the Rosenbrock function shows that Adam takes a long time and it doesn’t find come all the way to the minima. SGD with momentum overshoots a lot from the start but with the help of the momentum it finds the minima.
You can test out how the different optimizers behave with the Resenbrock function here: Peltarion’s Rosenbrock test
In this visualization of a saddle point in the optimization landscape (made by Alec Radford) the curvature along different dimensions has different signs, one dimension curves up and another down.
The SGD without momentum gets stuck on the top but manages to break loose with momentum. Algorithms such as RMSprop will see very low gradients in the saddle direction. Due to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed.