Tips for Kagglers From A Solo Silver Medal Winner

In this article, I share how I got started with the competition, my struggles, my journey to the final solution, and everything I learned in between.
Aman Dalmia
Created on August 31|Last edited on November 2
Comment
﻿
I recently took part in the SIIM-ISIC Melanoma Classification challenge hosted on Kaggle during the last 2 weeks of the competition and was able to secure my first silver🏅on the platform. This was only my 2nd Kaggle competition and the first one in 4 years.
﻿
﻿
﻿
The point of mentioning this is merely to indicate what is possible even within such a short time frame. It was a tremendous learning opportunity for me, and I regret not being a part of this earlier. 
This post's goal is to walk you through my journey of beginning the competition into its final phase — my initial struggles, starting step-by-step, and arriving at my final submission — and to share everything that I learned in the process, specifically about the Kaggle community. This is how the blog post is structured:
Participating in the competition
Arriving at the final solution
Some things I avoided
Learnings
Experiment Tracking using W&B
References
I hope this blog post can convince you that irrespective of what you might think of yourself, you are always welcomed in the Kaggle community, and the best way to start learning is by just diving right in and doing something.
Table of ContentsParticipating in the CompetitionArriving at the final solution 🗻Some of the Things I AvoidedLearningsReferences
﻿
Participating in the CompetitionThe main reason that I had not participated in a Kaggle competition seriously before was that I was not sure what I could learn from it that I would not learn from my everyday work as an AI researcher. Hence, most of my time was spent entirely at work.
However, over time, I have realized that I have not been able to stay updated about the latest practical tips for improving model performance as most of my time at work is dedicated to data, compounded by the data size being relatively small. So, I needed a playground to test out my wild ideas and learn about the techniques that worked in practice.
This was a significant reason for me to explore Kaggle. I found the SIIM-ISIC Melanoma Classification challenge as the perfect starting point. The binary image classification task is simple enough for me to get a feel for the platform and have a realistic chance of performing well as I have been working on computer vision for many years.
﻿
﻿
﻿
Initial Struggles 😔Contrary to what I had expected, I felt extremely overwhelmed seeing the number of discussions, knowledge sharing (in the form of data and code), and the current best leaderboard scores. I wanted to keep track of everything and understand everything in one day to be at the same place as everyone else. I ended up putting much pressure on myself, and this made me lose my initial motivation.
Slowly Making Progress 🤞Thankfully, I realized quickly that I could not get a hold of everything, and there will be things that I am missing out on. I convinced myself to be comfortable knowing that I will not know everything there is to know. I started slowly by creating the right cross-validation data for validating model performance. Creating the right split was crucial for this competition, and as I will mention later, the cross-validation score was critical in determining the final outcome. Then, I created a baseline model using ResNet18 [1] with a basic input processing pipeline. This helped me get an AUC of 0.84 on the public LeaderBoard (LB) and allowed me to complete the pipeline of submitting a result.
﻿
﻿
﻿
I usually breathe a sigh of relief once the end-to-end pipeline of anything that I am working on is complete. This allows me the flexibility to tune individual components and be sure that there are no unknown unknowns down the road that would need my attention. So, I was happy after doing this, and going forward, I kept adding new components to it one at a time, and I’ll describe what finally landed me in the top 5% out of 3000+ competitors.
Before describing the final solution, I want to share three philosophies that helped me iterate faster:
The first one is a piece of advice from Jeremy Howard from fast.ai that practically, you often do not need all of the data or even the entire input to arrive at a decent performance. This could mean using a fraction of the entire dataset and using a smaller resolution (say, 224 x 224) instead of using the entire image, which could be 1024 x 1024 and take much time to load and process.
The other philosophy can be considered widely known to matter for improving model performance truly. However, I will still state it explicitly — feed the data correctly, more data helps, use the correct data augmentations, find the right optimization recipe, and identify the right model class. Just focusing on these things should land you a pretty decent performance. It is all about finding the right mix now. Once I found the right recipe on images with a smaller resolution, I used the same recipe with a larger image resolution. That is expected to improve performance.
Ensembling generally always helps. Model ensembling has a theoretical justification for model improvement with the current set of base models. This blog post is an excellent reference. Simple techniques like combining diverse models (models with low correlation in their predictions) and combining the same model trained on different input sizes can provide a significant lift.
Arriving at the final solution 🗻Once I had a solid base, I played around with different aspects of the entire pipeline, which I list towards the end of the last section— model, optimization, image sizes, using external data (data that is not a part of the competition), data augmentation and handling data imbalance.
My best model was an ensemble (taking the mean of the predictions of the individual models) of 4 models.
Image Sizes3 models were trained on 512x512 images and 1 model on 384x384.
Augmentations2 models use the following augmentations (using kornia):
- name: Rescale
  params:
      value: 255.
- name: RandomAffine
  params:
      degrees: 180
      translate:
        - 0.02
        - 0.02
- name: RandomHorizontalFlip
  params:
      p: 0.5
- name: ColorJitter
  params:
      saturation:
        - 0.7
        - 1.3
      contrast:
        - 0.8
        - 1.2
      brightness: 0.1
- name: Normalize
  params:
      mean: imagenet
      std: imagenet
The meaning of different parameters can be found from kornia's documentation. The other 2 models additionally use Cutout [3]:
- name: RandomErasing
  params:
      p: 0.5
      ratio:
        - 0.3
        - 3.3
      scale:
        - 0.02
        - 0.1
Data SamplingTo iterate faster and handle data imbalance, instead of upsampling the minority class, I downsample the majority class per epoch. However, to avoid wasting data, I sample different instances from the majority class per epoch.
NetworkThe same network architecture is used for all 4 models — EfficientNet-B5 [2] features followed by a Linear layer. I tried a vast range of models, but EfficientNet-B5 was the best single model.
Optimization — SuperConvergenceOne of the decisions I made for iterating faster was to restrict the number of epochs to 20. To achieve this, I used the OneCycle learning rate scheduler [4]. This scheduler requires one to specify the minimum and maximum learning rates. These two were found to be 5e-6 and 2e-4, respectively, using the LR-range-test. Also, to address overfitting, I apply a weight decay of 0.1. Adam [5] usually obtains the best results in rapid time. However, this [6] paper clarifies how weight decay is not applied correctly in Adam and proposes a modification to Adam — AdamW. That is the final optimizer that I have used.
Test Time Augmentation (TTA)TTA was introduced in fast.ai to improve model performance during inference. Contrary to the general wisdom of turning data augmentation off during inference, TTA refers to keeping it on during inference. Since there is randomness associated with augmentations, trusting on running inference only once can lead to wrong conclusions and might even give a worse performance. Hence, in TTA, we run inference for N_TTA number of times. Finally, to obtain the prediction for each instance, we combine the predictions across the N_TTA inference runs. One simple way of combining is by taking the mean of the predictions. For this competition, I used N_TTA = 15. Note that this is computationally very expensive, but it leads to clear performance improvement for this task.
﻿
Run set81
﻿
Things I Wanted To Try but Could NotThat’s it. These components helped me land in the top 5% of the competition in just 2 weeks. If they seem too simple to you, then you are right! 
However, there were still many ideas that I wanted to try but I couldn’t, given the time shortage. I’m also listing them down in case there is something to learn from them:
Train on images of different sizes better: I would like to diligently keep track of the best experiments on images of lower resolution and re-run ALL of them for images with higher resolution followed by ensembling all of them. I didn’t stay consistent with this framework.
Combine metadata with images: The competition also provides a lot of metadata for each image that could be used to improve performance. The winning team used the metadata by stacking it with the features extracted from the CNN before feeding it to the classifier.
Spend time improving my ensembles: As I mention in the learnings section, ensembling was key and I, unfortunately, didn’t spend time learning about the best practices here and just stuck to averaging the predictions of my base models. More on this later.
Trying more augmentations: There were a variety of augmentations discussed throughout the competition along with augmentations specific to the type of data in the augmentation (like adding hair randomly to images because a lot of images contained hairs).
Custom head: My network consisted of a single linear layer after the features extracted from the convolutional network. However, it is often better to add more than one linear layer.
Some of the Things I AvoidedThere were many overly complicated things mentioned in the discussions like generating images from the data and using that as additional training data. These might be good ideas but I tend to like simplistic solutions and I wanted to achieve the best possible results using them. I still feel that I could get a higher score simply by using a better ensembling technique.
Learnings
1 - The power of ensemblingEnsembling is the technique of combining the predictions of several independent models/learners. I won’t go into technical details here as there are many excellent articles that already do that (this one for example). I noticed people reporting cross-validation (CV) score much higher than what I was getting with a similar setup. It was only much later that I realized that the best CV score for a single model was still very close to what I was getting. The winners used advanced ensembling techniques like stacking, ranking (this one is specific to the metric being optimized here), and model blending to improve their final performance dramatically. One of the top submissions was actually just a weighted average of 20 public submissions. Thus, one should focus on getting the right ensembling recipe.
2 - Trust Your CVOne of the dramatic moments of the competition was when the private leaderboard results were opened. There was a massive shake-up in the entire leaderboard with many top submissions dropping significantly low and many teams jumping up more than 1000 spots (I myself climbed around 800 spots). This left many people reasonably disappointed as they had overfitted to the public leaderboard. However, a lot of the solution overviews posted after the competition ended strongly emphasized the need to focus on the CV score as the public leaderboard can often lie but they found a strong correlation between their CV scores and the private leaderboard ranking. So, the mantra became: “In CV, we trust”. Because I found this out very late, contrary to the general method of reporting cross-validation performance by taking the mean and standard deviation across all the folds, there is a better way that is typically used to report CV score in Kaggle competitions. This notebook illustrates that very nicely. Essentially, for each fold, you should save your Out-Of-Fold (OOF) predictions (predictions for the instances forming the validation set in that particular fold) for each fold. At the end of 5 folds, you will have a list of predictions for each sample in the train set and you should compute the metric on this list of predictions to get your final CV score.
3 - Code Sharing Through KernelsMany people shared their submission notebooks publicly which truly helped clarify several doubts that often stay, even after people have tried their best (or not) to explain their methodology. It also helps to learn about minor implementation details that often get left out while talking about the big picture. Additionally, reading other people’s code is a great way of improving your own coding skills. I personally learned quite a bit as well. Finally, Kaggle Kernels offer both GPU and TPU for a limited quota. This is awesome and it removes the need to depend on on-prem infra or the requirement of being able to spend a lot of money on cloud VMs.
4 - Discussions and Knowledge Sharing 💡I was truly surprised to see how willing people were to engage with each other, often sharing their code and data to provide a very good starting point for other people. People also shared their approaches while the competition was still ongoing while also discussing what worked for them and what didn’t. Considering the fact that this is a competition with a monetary prize associated with it, I was truly taken aback by how collaborative the nature of the discussions was and how kind everyone was to any newcomer asking silly questions as well.
5 - A Community of Doers 👨‍💻👩‍💻Finally, I just want to say that it was amazing to have found a community of doers — people who focus on actually getting the job done, with whom I could have deep, meaningful technical discussions (and jokes). I have struggled to find the right online community for myself and although I am yet to find my footing there, I definitely know that I am here to stay! :)
References﻿Deep Residual Learning for Image Recognition﻿
﻿EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks﻿
﻿Improved Regularization of Convolutional Neural Networks with Cutout﻿
﻿Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates﻿
﻿Adam: A Method for Stochastic Optimization﻿
﻿Decoupled Weight Decay Regularization﻿
General Resources﻿Collections of solutions from the competition﻿
﻿Ten Techniques Learned From fast.ai﻿
﻿Kornia: an Open Source Differentiable Computer Vision Library for PyTorch﻿
Cross-validation﻿How To CV and How To Ensemble OOF Files﻿
﻿Computing CV score correctly and example of stacking﻿
Optimization﻿Intuition to LR Range Test, Cyclical LR and the One-Cycle Policy﻿
﻿A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay﻿
﻿Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates﻿
﻿AdamW - Decoupled Weight Decay Regularization﻿
Ensembling﻿Kaggle ensembling guide﻿
﻿Several model averaging techniques﻿
﻿Improve blending using Rankdata﻿
﻿
﻿
Add a comment
Tags: Intermediate, Computer Vision, Classification, PyTorch, Tutorial, Panels, Plots, Sweeps, Kaggle, Health Care
Iterate on AI agents and models faster. Try Weights & Biases today.