New to Peltarion? Discover our deep learning Platform

A single deep learning platform to build and deploy your projects, even if you’re not an AI superstar.

FIND OUT MORE

Learning rate schedule

Learning rate schedule lets you vary the learning rate as the training progresses.

When you train a model, the optimizer decides on the direction that gives the largest gradient descent of the loss function.

The learning rate is controlling the size of the update steps along the gradient. This parameter sets how much of the gradient you update with, a

  • Larger learning rate let the model converge faster but may overstep the optimal point

  • Smaller learning rate is more receptive to the loss function but may require more epochs to converge and may get stuck in local minima.

The Learning rate schedule allows you to start training with larger (or smaller) steps and change the learning rate, according to a schedule.

Learning rate schedule can give better training performance

There is no go-to schedule for all models. Changing the learning rate, in general, has shown to make training less sensitive to the learning rate value you pick for it. So using a learning rate schedule can give better training performance and make the model converge faster.

How to use learning rate scheduling

It’s almost always a good idea to use a schedule. For most models, try the exponential decay schedule first.

The best learning rate depends on the dataset and task that your model is learning. In many cases, it also depends on how close your model already is to the optimal solution.
The Learning rate schedule helps you with both aspects, since it sweeps over a range of learning rates, getting smaller as the model is expected to reach its optimal solution.

Possible to use a higher max learning rate

As a rule of thumb, compared to training without a schedule, you can use a slightly higher maximum learning rate. Since the learning rate changes over time, the whole training is not so sensitive to the value picked.

Exponential decay

For most models, it makes sense to try out the exponential decay schedule first.

The exponential schedule divides the learning rate by the same factor (%) every epoch. This means that the learning rate will decrease rapidly in the first few epochs, and spend more epochs with a lower value, but never reach exactly zero.

\[\text{Learning rate}(epoch) = \text{Initial learning rate} * (1 - \frac{\text{Decay rate}}{100})^\text{epoch}\]

Example:

Exponential decay graph
Figure 1. The learning rate decreases exponentially.

Parameters

Decay per epoch (%): Factor to decrease the Learning rate. Default: 5

Linear decay

The linear schedule decreases the learning rate by the same amount (decrement) every epoch. Depending on the Decrement per epoch, the learning rate can reach zero quite fast, so set the value depending on the Learning rate.

\[\text{Learning rate}(epoch) = \text{Initial learning rate} - (\text{Epoch} * \text{Decrement per epoch})\]

Example:

Linear decay graph
Figure 2. Linear decay with Learning rate=0.01 and Decrement per epoch=0.002.

Parameters

Decrement per epoch: Decrease of learning rate per epoch. Update depending on the set Learning rate Default: 0.0001.

Triangle schedule

The triangle schedule consists of two parts:

  • The first is a linear learning rate increase during Warm-up epochs proportion no of epochs, from 0 up the set Learning rate.

  • The second part is a linear decay that decreases the learning rate by the same Decrement per epoch.

Triangle decay is recommended for text classification using BERT fine-tuning, but can also be applied to all other kinds of models.

Example:

Triangle decay graph
Figure 3. Triangle decay with Warm-up proportion=0.5, which means that the learning rate peak at 0.5. Decrement per epoch=0.000004. This lets the learning rate go to 0 after 3 epochs.

Parameters

Warm-up epochs proportion: Length in epochs of learning rate increase from 0 to 1. Default: 1

Decrement per epoch: Decrease of learning rate per epoch. Update depending on the set Learning rate. Default: 0.0001