Adadelta is a more robust extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate based on a fixed moving window of gradient updates, instead of accumulating all past gradients.
Adadelta functions as a stochastic gradient descent-method. It has a float known as the Adadelta decay factor (p).
For the Adadelta optimizer, you can adjust the parameters below. Our advice is to leave the parameters at default.
The learning rate is controlling the size of the update steps along the gradient. This parameter sets how much of the gradient you update with, where 1 = 100% but normally you set much smaller learning rate, e.g., 0.001.
In our rolling ball analogy, we’re calculating where the ball should roll next in discrete steps (not continuous). How long these discrete steps are is the learning rate.
Choosing a good learning rate is important when training a neural network. If the ball rolls carefully with a small learning rate we can expect to make consistent but very small progress (this corresponds to having a small learning rate). The risk though is that the ball gets stuck in a local minima not reaching the global minima.
Larger steps mean that the weights are changed more every iteration, so that they may reach their optimal value faster, but may also miss the exact optimum.
Smaller steps mean that the weights are changed less every iteration, so it may take more epochs to reach their optimal value, but they are less likely to miss optima of the loss function.
Learning rate scheduling allows you to use large steps during the first few epochs, then progressively reduce the step size as the weights come closer to their optimal value.
p (epsilon) is called fuzz or decay factor.