Why Learning Rate Is Important in Deep Learning Pipelines

In this article, we look at a case study from a Kaggle Competition, explaining why Learning Rate is an important hyper-parameter for deep learning pipelines.
Saurav Maheshkar
Created on May 3|Last edited on February 15
Comment
Has it ever happened that you're using the same codebase as someone else but achieving surprisingly different results? This happened to me in the recently launched Kaggle Competition "UW-Madison GI Tract Image Segmentation." And I learned something I wanted to share. 
Context: @awsaf49 had made a great starter kernel for the competition. After going through their codebase, I upvoted the kernel (please upvote kernels you find useful !!!) and created my own pipeline using it as a template. 
Now, I frequently participate in Kaggle Competitions and have made a custom hyperparameter configuration for various modalities. After I had customized the codebase according to my needs, I started training various models and submitting results. Surprisingly, I noticed that the scores differed by about 0.050 - 0.070 — which, if you have ever participated in Kaggle Competitions, is A LOT! 
I was about 100 positions lower than @awsaf49's inference kernel. I did what any other Kaggler would do: I started pouring over the discussion forums to look for the "Single Best Model" Posts. From past experiences and listening to talks from various Kaggle Grandmasters on the Weights & Biases Youtube Channel, I've learned that the most important task is a good CV-LB split and single models.
So I tried to suppress my urge to fork the best-scoring kernel and submit and instead tried to focus on figuring out why my codebase wasn't performing optimally. Turns out there was a key parameter that was (logarithmically) different from others. Surprise, Surprise, it was the LEARNING RATE!
I was using a learning rate of 2e-5, whereas others were using 2e-3. This taught me a few things:-
Deep Learning is extremely fragile 😅. Now and then, I'm reminded of this, and surprisingly a lot of it happens during Kaggle Competitions. Previously I learned that Random Seeds make a huge impact on performance. Coincidently it was around the same time when "Torch.manual_seed(3407) is all you need" exploded over Twitter. In fact, I won a Bronze model in that competition by submitting a baseline model with a carefully chosen seed. If you want to learn more, kindly refer to this article (appropriately named "The Fluke"). PS: The best seed was 42! Douglas Adams FTW!
“The Answer to the Great Question... Of Life, the Universe and Everything... Is... Forty-two,' said Deep Thought, with infinite majesty and calm.” ― Douglas Adams, The Hitchhiker's Guide to the GalaxyEach competition and dataset is different. What works for some architecture and modalities might not work for others. For example, this was fairly well known during the early days of Vision Transformers (ViTs). ViTs need bigger datasets and meticulous training paradigms to converge well and fail in the small data regime.
﻿
How to Set Random Seeds in PyTorch and Tensorflow
Learn how to set the random seed for everything in PyTorch and Tensorflow in this short tutorial, which comes complete with code and interactive visualizations.
How Well Can You Kaggle with Just One Hour a Day?
A whimsical challenge to see how far one can go in a Kaggle competition with limited time, limited resources and pretty much limited everything else.
﻿
ResultsEnough talk! Let's see some metrics in a parallel coordinates plot to see why learning rate is an important hyperparameter to focus on.
﻿
Run set10
﻿
As we can see, different initial values of learning rates matter a lot and lead to significantly different Metric Values! 
Let's look at per-fold metrics to see how the learning rate varies.
﻿
Run set10
﻿
SummaryIn this article, you saw why the learning rate is important for your deep learning pipelines.  To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering more tips and tricks to boost model performance let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.
Recommended Reading
How To Use GPU with PyTorch 
A short tutorial on using GPUs for your deep learning models with PyTorch, from checking availability to visualizing usable.
PyTorch Dropout for regularization - tutorial 
Learn how to regularize your PyTorch model with Dropout, complete with a code tutorial and interactive visualizations
How to save and load models in PyTorch
This article is a machine learning tutorial on how to save and load your models in PyTorch using Weights & Biases for version control.
What Is Cross Entropy Loss? A Tutorial With Code
A tutorial covering Cross Entropy Loss, with code samples to implement the cross entropy loss function in PyTorch and Tensorflow with interactive visualizations.
How to Compare Keras Optimizers in Tensorflow for Deep Learning
A short tutorial outlining how to compare Keras optimizers for your deep learning pipelines in Tensorflow, with a Colab to help you follow along.
Tutorial: Regression and Classification on XGBoost
A short tutorial on how you can use XGBoost with code and interactive visualizations.
﻿
﻿
Add a comment
Tags: Articles, Kaggle, Domain Agnostic, Experiment, Beginner
Iterate on AI agents and models faster. Try Weights & Biases today.