Skip to main content

How to Use Cosine Decay Learning Rate Scheduler in Keras

In this article, we'll learn how to use cosine decay in Keras, providing you with code and interactive visualizations so you can give it a try it for yourself.
Created on February 9|Last edited on March 3
In this brief article, we will see how you can correctly use the cosine-decay learning rate scheduler using Keras tf.keras.optimizers.schedules.CosineDecay API.
Here's what we'll cover:

Table of Contents



Try out the colab notebook here \rightarrow

Introduction to Cosine Decay

While training your ML model, it's important to pick the correct learning rate for optimizing its weights. If the learning rate is too high, optimization diverges; if it's too low, it takes a longer time for convergence.
We can train a model with a constant learning rate, but it has been seen that the model converges better by correctly lowering (decaying) the learning rate while training progresses. Loshchilov and Hutter (2016) observed that the learning rate might not be decreased too drastically in the beginning and "refine" with a small learning rate in the end.
We can use the CosineDecay in Keras as shown below:
cosine_decay_scheduler = tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate, decay_steps, alpha=0.0, name=None
)

optimizer = tf.keras.optimizers.Adam(learning_rate=cosine_decay_scheduler)

model.compile(
optimizer=optimizer,
loss=...,
metrics=[...]
)
A few notes:
  • The initial_learning_rate is what you want to start with.
  • decay_steps is the number of steps in which you want the learning rate to decay. If you use tf.data to create a dataloader, the total number of steps is given by len(dataloader) * number_epochs. Note that decay_steps can be the total number of steps or a fraction of it. You can also use any arbitrary number of steps.
  • alpha determines the final learning rate as a fraction of initial_learning_rate. Assuming the initial learning rate to be 1e-3. If you want the final learning rate to be 1e-5, alpha should be 1e-2.

Experimental Details and Baseline Model

Try out the colab notebook here \rightarrow

Below, we're training a simple image classifier on the FashionMNIST dataset. The panels shown below are for the baseline model. The model was trained three times to account for the variance in model training.
Note that no augmentation or specialized regularization technique were used. We're using the MobileNetv2 model and are trained for 10 epochs each.

Run set
3


Experimenting With Cosine Decay

In this section, we'll be using the cosine decay scheduler to train our models. We'll be experimenting with different decay_steps to find out how quickly the initial learning rate decays to the final learning rate.
First, let's compute the total number of steps. The decay_steps is determined by num_steps which is a fraction of the total number of steps. For example, if num_steps is 0.4 the decay_steps will be equal to 40% of the total number of steps. We can also say that the learning rate will decay for 4 epochs to reach the final learning rate.
total_steps = len(trainloader)*configs["epochs"]
decay_steps = total_steps * configs["num_steps"]

cosine_decay_scheduler = tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate = configs["learning_rate"],
decay_steps = decay_steps,
alpha=0.1
)
In our experiment, the num_steps is in the range of 0.1, 0.2, 0.3, and so on till 1.0. Let's look at the learning rate decay below:
Pass WandbMetricsLogger callback to model.fit(callbacks=[...]) to automatically log learning rate to Weights & Biases.
💡

Run set
33

When num_steps is 0.1, the learning rate reached the final rate quickly. By increasing the num_steps we controlled the decay_steps, allowing the learning rate to decrease slowly.
Let's look at the effect of varying decay_steps on training loss and final evaluation accuracy:

Run set
33


Observations

By decaying the learning rate we were able to improve the evaluation accuracy by considerable percentage points.
  • The baseline eval_acc is 85.29% with a ~0.02% std deviation. The model trained using cosine decay with decay_step of 70% of the total steps (num_steps=0.7) gave the eval_acc of 91.26% and an std deviation of only ~0.002%. That's an improvement of ~6% by just using a simple learning rate scheduler.
  • As we increased num_steps the eval_acc improved aligning with the observation of Loshchilov and Hutter (2016). The best result decayed the learning rate from 1e-3 to 1e-4 for 70% of the training steps and finally refined the model with a very small learning rate of 1e-4.
  • Also, observe that when the learning rate was decayed for the entire duration of the training, the model achieved the 2nd best mean eval_acc of 91.20% (difference of only 0.06%). This can be a good default for any custom dataset based on this experimentation.
  • The standard deviation also improved when the learning rate decayed while training making the results more reproducible.

Try out the colab notebook here \rightarrow

Experiments to Try

This study was confined to only cosine decay and the effect of decay_steps. Here are a few more thought experiments you can do:
  • Try CosineDecayRestarts and compare it with CosineDecay.
  • See the effect of lowering the final learning rate while keeping the decay_steps constant. A model trained with a small learning rate takes longer to converge. What's the sweet spot (final learning rate) to which we can decay in this experiment?
  • Try a bigger model that will make the baseline experiment overfit quickly. Will decaying the learning rate help with regularization thus preventing the model to overfit?
Give the colab notebook to spin and try out the experiments. And please share the results in the form of a W&B Report like this one if you'd like to show off your results!
Iterate on AI agents and models faster. Try Weights & Biases today.