Skip to main content

Why Learning Rate Is Important in Deep Learning Pipelines

In this article, we look at a case study from a Kaggle Competition, explaining why Learning Rate is an important hyper-parameter for deep learning pipelines.
Created on May 3|Last edited on February 15
Has it ever happened that you're using the same codebase as someone else but achieving surprisingly different results? This happened to me in the recently launched Kaggle Competition "UW-Madison GI Tract Image Segmentation." And I learned something I wanted to share.
Context: @awsaf49 had made a great starter kernel for the competition. After going through their codebase, I upvoted the kernel (please upvote kernels you find useful !!!) and created my own pipeline using it as a template.
Now, I frequently participate in Kaggle Competitions and have made a custom hyperparameter configuration for various modalities. After I had customized the codebase according to my needs, I started training various models and submitting results. Surprisingly, I noticed that the scores differed by about 0.050 - 0.070 — which, if you have ever participated in Kaggle Competitions, is A LOT!
I was about 100 positions lower than @awsaf49's inference kernel. I did what any other Kaggler would do: I started pouring over the discussion forums to look for the "Single Best Model" Posts. From past experiences and listening to talks from various Kaggle Grandmasters on the Weights & Biases Youtube Channel, I've learned that the most important task is a good CV-LB split and single models.
So I tried to suppress my urge to fork the best-scoring kernel and submit and instead tried to focus on figuring out why my codebase wasn't performing optimally. Turns out there was a key parameter that was (logarithmically) different from others. Surprise, Surprise, it was the LEARNING RATE!
I was using a learning rate of 2e-5, whereas others were using 2e-3. This taught me a few things:-
  • Deep Learning is extremely fragile 😅. Now and then, I'm reminded of this, and surprisingly a lot of it happens during Kaggle Competitions. Previously I learned that Random Seeds make a huge impact on performance. Coincidently it was around the same time when "Torch.manual_seed(3407) is all you need" exploded over Twitter. In fact, I won a Bronze model in that competition by submitting a baseline model with a carefully chosen seed. If you want to learn more, kindly refer to this article (appropriately named "The Fluke"). PS: The best seed was 42! Douglas Adams FTW!
“The Answer to the Great Question... Of Life, the Universe and Everything... Is... Forty-two,' said Deep Thought, with infinite majesty and calm.” ― Douglas Adams, The Hitchhiker's Guide to the Galaxy
  • Each competition and dataset is different. What works for some architecture and modalities might not work for others. For example, this was fairly well known during the early days of Vision Transformers (ViTs). ViTs need bigger datasets and meticulous training paradigms to converge well and fail in the small data regime.




Results

Enough talk! Let's see some metrics in a parallel coordinates plot to see why learning rate is an important hyperparameter to focus on.

Run set
10

As we can see, different initial values of learning rates matter a lot and lead to significantly different Metric Values!
Let's look at per-fold metrics to see how the learning rate varies.

Run set
10


Summary

In this article, you saw why the learning rate is important for your deep learning pipelines. To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering more tips and tricks to boost model performance let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.

Iterate on AI agents and models faster. Try Weights & Biases today.