Learning rate schedule

Learning rate schedule lets you vary the learning rate as the training progresses.

When you train a model, the optimizer decides on the direction that gives the largest gradient descent of the loss function.

The learning rate is controlling the size of the update steps along the gradient. This parameter sets how much of the gradient you update with, a

  • Larger learning rate let the model converge faster but may overstep the optimal point

  • Smaller learning rate is more receptive to the loss function but may require more epochs to converge and may get stuck in local minima.

The Learning rate schedule allows you to start training with larger (or smaller) steps and change the learning rate, according to a schedule.

Learning rate schedule can give better training performance

There is no go-to schedule for all models. Changing the learning rate, in general, has shown to make training less sensitive to the learning rate value you pick for it. So using a learning rate schedule can give better training performance and make the model converge faster.

How to use learning rate scheduling

It’s almost always a good idea to use a schedule. For most models, try the exponential decay schedule first.

The best learning rate depends on the dataset and task that your model is learning. In many cases, it also depends on how close your model already is to the optimal solution.
The Learning rate schedule helps you with both aspects, since it sweeps over a range of learning rates, getting smaller as the model is expected to reach its optimal solution.

Possible to use a higher max learning rate

As a rule of thumb, compared to training without a schedule, you can use a slightly higher maximum learning rate. Since the learning rate changes over time, the whole training is not so sensitive to the value picked.

Exponential decay

For most models, it makes sense to try out the exponential decay schedule first.

The exponential schedule divides the learning rate by the same factor (%) every epoch. This means that the learning rate will decrease rapidly in the first few epochs, and spend more epochs with a lower value, but never reach exactly zero.

\[\text{Learning rate}(epoch) = \text{Initial learning rate} * (1 - \frac{\text{Decay rate}}{100})^\text{epoch}\]

Example:

Exponential decay graph
Figure 1. The learning rate decreases exponentially.

Parameters

Decay per epoch (%): Factor to decrease the Learning rate. Default: 5

Linear decay

The linear schedule decreases the learning rate by the same amount (decrement) every epoch. Depending on the Decrement per epoch, the learning rate can reach zero quite fast, so set the value depending on the Learning rate.

\[\text{Learning rate}(epoch) = \text{Initial learning rate} - (\text{Epoch} * \text{Decrement per epoch})\]

Example:

Linear decay graph
Figure 2. Linear decay with Learning rate=0.01 and Decrement per epoch=0.002.

Parameters

Decrement per epoch: Decrease of learning rate per epoch. Update depending on the set Learning rate Default: 0.0001.

Triangle schedule

The triangle schedule consists of two parts:

  • The first is a linear learning rate increase during Warm-up epochs proportion no of epochs, from 0 up the set Learning rate.

  • The second part is a linear decay that decreases the learning rate by the same Decrement per epoch.

Triangle decay is recommended for text classification using BERT fine-tuning, but can also be applied to all other kinds of models.

Example:

Triangle decay graph
Figure 3. Triangle decay with Warm-up proportion=0.5, which means that the learning rate peak at 0.5. Decrement per epoch=0.000004. This lets the learning rate go to 0 after 3 epochs.

Parameters

Warm-up epochs proportion: Length in epochs of learning rate increase from 0 to 1. Default: 1

Decrement per epoch: Decrease of learning rate per epoch. Update depending on the set Learning rate. Default: 0.0001

Reduce on plateau

The reduce on plateau schedule decreases the learning rate by the specified Decay percentage when the given Metric is stagnating longer than the Patience allowed.
The given metric is considered to stagnate as long as its best value hasn’t improved by at least 10%.

Use when you don’t know how your model behaves with your data
Reducing the learning rate when the given metric flattens out can be useful when you don’t know how a certain model will behave with your data.
The largest possible learning rate is kept as long as possible to progress quickly, but reduction does occur, allowing finer optimization, when significant progress would no longer be achieved otherwise.

Set early stopping’s Patience several times larger than reduce on plateau’s Patience
When using this schedule with the early stopping feature, make sure that the early stopping’s Patience is several times larger than the Patience of the reduce on plateau schedule. This gives the model more opportunity to improve.

\[\text{Learning rate}(\text{epoch}) = \left\{ \begin{array}{ll} (1-\frac{\text{Decay rate}}{100})*\text{Learning rate}(\text{epoch}-1), \\ \text{if Metric} \geq 0.9*\text{best metric } \\ \text{and epochs since best metric } \geq \text{Patience}\\ \\ \text{Learning rate}(\text{epoch}-1) , \text{otherwise} \end{array} \right.\]

Example:

Reduce on plateau graph
Figure 4. Reduce on plateau schedule with Learning rate = 0.00001, Decay = 50%, and Warm-up = 0.5 epochs. The learning rate only decreases when the metric doesn’t improve by at least 10% over the given Patience.

Parameters

Decay (%): Proportional decrease of the learning rate when a plateau occurs. Default: 10.

Patience: Number of epochs to wait before decreasing the learning rate when a plateau occurs. It can be a decimal value if the Metric is a training-based value, but must be an integer otherwise. Default: 2.

Metric: The value to monitor for progress or plateau. Default: Validation Loss.

Warm-up epochs: Number of epochs over which the Learning rate is ramped up from 0 to its specified value. The reduce on plateau schedule isn’t active during this period. Default: 0.5.

Was this page helpful?
YesNo