Tips for Kagglers From A Solo Silver Medal Winner

How I got started with the competition, my struggles, journey to the final solution and everything I learned in between. Made by Aman Dalmia using W&B
Aman Dalmia

I recently took part in the SIIM-ISIC Melanoma Classification challenge hosted on Kaggle during the last 2 weeks of the competition and was able to secure my first silver🏅on the platform. This was only my 2nd Kaggle competition and the first one in 4 years.

image.png

The point of mentioning this is merely to indicate what is possible even within such a short time frame. It was a tremendous learning opportunity for me, and I regret not being a part of this earlier. This post's goal is to walk you through my journey of beginning the competition into its final phase — my initial struggles, starting step-by-step, and arriving at my final submission — and to share everything that I learned in the process, specifically about the Kaggle community. This is how the blog post is structured:

I hope this blog post can convince you that irrespective of what you might think of yourself, you are always welcomed in the Kaggle community, and the best way to start learning is by just diving right in and doing something.

Participating in the Competition

The main reason that I had not participated in a Kaggle competition seriously before was that I was not sure what I could learn from it that I would not learn from my everyday work as an AI researcher. Hence, most of my time was spent entirely at work.

However, over time, I have realized that I have not been able to stay updated about the latest practical tips for improving model performance as most of my time at work is dedicated to data, compounded with the data size being relatively small. So, I needed a playground to test out my wild ideas and learn about the techniques that worked in practice.

This was a significant reason for me to explore Kaggle. I found the SIIM-ISIC Melanoma Classification challenge as the perfect starting point. The binary image classification task is simple enough for me to get a feel for the platform and have a realistic chance of performing well as I have been working on computer vision for many years.

image.png

Initial Struggles 😔

Contrary to what I had expected, I felt extremely overwhelmed seeing the number of discussions, knowledge sharing (in the form of data and code), and the current best leaderboard scores. I wanted to keep track of everything and understand everything in one day to be at the same place as everyone else. I ended up putting much pressure on myself, and this made me lose my initial motivation.

Slowly Making Progress 🤞

Thankfully, I realized quickly that I could not get a hold of everything, and there will be things that I am missing out. I convinced myself to be comfortable knowing that I will not know everything there is to know. I started slowly by creating the right cross-validation data for validating model performance. Creating the right split was crucial for this competition, and as I will mention later, the cross-validation score was critical in determining the final outcome. Then, I created a baseline model using ResNet18 [1] with a basic input processing pipeline. This helped me get an AUC of 0.84 on the public LeaderBoard (LB) and allowed me to complete the pipeline of submitting a result.

image.png

I usually breathe a sigh of relief once the end-to-end pipeline of anything that I am working on is complete. This allows me the flexibility to tune individual components and be sure that there are no unknown unknowns down the road that would need my attention. So, I was happy after doing this, and going forward, I kept adding new components to it one at a time, and I’ll describe what finally landed me in the top 5% out of 3000+ competitors.

Before describing the final solution, I want to share three philosophies that helped me iterate faster:

Ensembling generally always helps. Model ensembling has a theoretical justification for model improvement with the current set of base models. This blog post is an excellent reference. Simple techniques like combining diverse models (models with low correlation in their predictions) and combining the same model trained on different input sizes can provide a significant lift.

Arriving at the final solution 🗻

Once I had a solid base, I played around with different aspects of the entire pipeline, which I list towards the end of the last section— model, optimization, image sizes, using external data (data that is not a part of the competition), data augmentation and handling data imbalance. My best model was an ensemble (taking the mean of the predictions of the individual models) of 4 models.

Image Sizes

3 models were trained on 512x512 images and 1 model on 384x384.

Augmentations

2 models use the following augmentations (using kornia):

- name: Rescale
  params:
      value: 255.
- name: RandomAffine
  params:
      degrees: 180
      translate:
        - 0.02
        - 0.02
- name: RandomHorizontalFlip
  params:
      p: 0.5
- name: ColorJitter
  params:
      saturation:
        - 0.7
        - 1.3
      contrast:
        - 0.8
        - 1.2
      brightness: 0.1
- name: Normalize
  params:
      mean: imagenet
      std: imagenet

The meaning of different parameters can be found from kornia's documentation. The other 2 models additionally use Cutout [3]:

- name: RandomErasing
  params:
      p: 0.5
      ratio:
        - 0.3
        - 3.3
      scale:
        - 0.02
        - 0.1

Data Sampling

To iterate faster and handle data imbalance, instead of upsampling the minority class, I downsample the majority class per epoch. However, to avoid wasting data, I sample different instances from the majority class per epoch.

Network

The same network architecture is used for all the 4 models — EfficientNet-B5 [2] features followed by a Linear layer. I tried a vast range of models, but EfficientNet-B5 was the best single model.

Optimization — SuperConvergence

One of the decisions I made for iterating faster was to restrict the number of epochs to 20. To achieve this, I used the OneCycle learning rate scheduler [4]. This scheduler requires one to specify the minimum and maximum learning rates. These two were found to be 5e-6 and 2e-4, respectively, using the LR-range-test. Also, to address overfitting, I apply a weight decay of 0.1. Adam [5] usually obtains the best results in rapid time. However, this [6] paper clarifies how weight decay is not applied correctly in Adam and proposes a modification to AdamAdamW. That is the final optimizer that I have used.

Test Time Augmentation (TTA)

TTA was introduced in fast.ai to improve model performance during inference. Contrary to the general wisdom of turning data augmentation off during inference, TTA refers to keeping it on during inference. Since there is randomness associated with augmentations, trusting on running inference only once can lead to wrong conclusions and might even give worse performance. Hence, in TTA, we run inference for N_TTA number of times. Finally, to obtain the prediction for each instance, we combine the predictions across the N_TTA inference runs. One simple way of combining is by taking the mean of the predictions. For this competition, I used N_TTA = 15. Note that this is computationally very expensive, but it leads to clear performance improvement for this task.

Section 2

Things I Wanted to try but Could not

That’s it. These components helped me land in the top 5% of the competition in just 2 weeks. If they seem too simple to you, then you are right!

However, there were still many ideas that I wanted to try but I couldn’t, given the time shortage. I’m also listing them down in case there is something to learn from them:

Some of the Things I Avoided

There were many overly complicated things mentioned in the discussions like generating images from the data and using that as additional training data. These might be good ideas but I tend to like simplistic solutions and I wanted to achieve the best possible results using them. I still feel that I could get a higher score simply by using a better ensembling technique.

Learnings

1 - The power of ensembling

Ensembling is the technique of combining the predictions of several independent models/learners. I won’t go into technical details here as there are many excellent articles that already do that (this one for example). I noticed people reporting cross-validation (CV) score much higher than what I was getting with a similar setup. It was only much later that I realized that the best CV score for a single model was still very close to what I was getting. The winners used advanced ensembling techniques like stacking, ranking (this one is specific to the metric being optimized here), and model blending to improve their final performance dramatically. One of the top submissions was actually just a weighted average of 20 public submissions. Thus, one should focus on getting the right ensembling recipe.

2 - Trust your CV

One of the dramatic moments of the competition was when the private leaderboard results were opened. There was a massive shake-up in the entire leaderboard with many top submissions dropping significantly low and many teams jumping up more than 1000 spots (I myself climbed around 800 spots). This left many people reasonably disappointed as they had overfitted to the public leaderboard. However, a lot of the solution overviews posted after the competition ended strongly emphasized the need to focus on the CV score as the public leaderboard can often lie but they found a strong correlation between their CV scores and the private leaderboard ranking. So, the mantra became: “In CV, we trust”. Because I found this out very late, contrary to the general method of reporting cross-validation performance by taking the mean and standard deviation across all the folds, there is a better way that is typically used to report CV score in Kaggle competitions. This notebook illustrates that very nicely. Essentially, for each fold, you should save your Out-Of-Fold (OOF) predictions (predictions for the instances forming the validation set in that particular fold) for each fold. At the end of 5 folds, you will have a list of predictions for each sample in the train set and you should compute the metric on this list of predictions to get your final CV score.

3 - Code sharing through kernels

Many people shared their submission notebooks publicly which truly helped clarify several doubts that often stay, even after people have tried their best (or not) to explain their methodology. It also helps to learn about minor implementation details that often get left out while talking about the big picture. Additionally, reading other people’s code is a great way of improving your own coding skills. I personally learned quite a bit as well. Finally, Kaggle Kernels offer both GPU and TPU for a limited quota. This is awesome and it removes the need to depend on on-prem infra or the requirement of being able to spend a lot of money on cloud VMs.

4 - Discussions and knowledge sharing 💡

I was truly surprised to see how willing people were to engage with each other, often sharing their code and data to provide a very good starting point for other people. People also shared their approaches while the competition was still ongoing while also discussing what worked for them and what didn’t. Considering the fact that this is a competition with a monetary prize associated with it, I was truly taken aback by how collaborative the nature of discussions was and how kind everyone was to any newcomer asking silly questions as well.

5 - A community of doers 👨‍💻👩‍💻

Finally, I just want to say that it was amazing to have found a community of doers — people who focus on actually getting the job done, with whom I could have deep, meaningful technical discussions (and jokes). I have struggled to find the right online community for myself and although I am yet to find my footing there, I definitely know that I am here to stay! :)

References

  1. Deep Residual Learning for Image Recognition
  2. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
  3. Improved Regularization of Convolutional Neural Networks with Cutout
  4. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
  5. Adam: A Method for Stochastic Optimization
  6. Decoupled Weight Decay Regularization

General Resources

Cross-validation

Optimization

Ensembling