I recently took part in the SIIM-ISIC Melanoma Classification challenge hosted on Kaggle during the last 2 weeks of the competition and was able to secure my first silver🏅on the platform. This was only my 2nd Kaggle competition and the first one in 4 years.
The point of mentioning this is merely to indicate what is possible even within such a short time frame. It was a tremendous learning opportunity for me, and I regret not being a part of this earlier. This post's goal is to walk you through my journey of beginning the competition into its final phase — my initial struggles, starting step-by-step, and arriving at my final submission — and to share everything that I learned in the process, specifically about the Kaggle community. This is how the blog post is structured:
I hope this blog post can convince you that irrespective of what you might think of yourself, you are always welcomed in the Kaggle community, and the best way to start learning is by just diving right in and doing something.
The main reason that I had not participated in a Kaggle competition seriously before was that I was not sure what I could learn from it that I would not learn from my everyday work as an AI researcher. Hence, most of my time was spent entirely at work.
However, over time, I have realized that I have not been able to stay updated about the latest practical tips for improving model performance as most of my time at work is dedicated to data, compounded with the data size being relatively small. So, I needed a playground to test out my wild ideas and learn about the techniques that worked in practice.
This was a significant reason for me to explore Kaggle. I found the SIIM-ISIC Melanoma Classification challenge as the perfect starting point. The binary image classification task is simple enough for me to get a feel for the platform and have a realistic chance of performing well as I have been working on computer vision for many years.
Contrary to what I had expected, I felt extremely overwhelmed seeing the number of discussions, knowledge sharing (in the form of data and code), and the current best leaderboard scores. I wanted to keep track of everything and understand everything in one day to be at the same place as everyone else. I ended up putting much pressure on myself, and this made me lose my initial motivation.
Thankfully, I realized quickly that I could not get a hold of everything, and there will be things that I am missing out. I convinced myself to be comfortable knowing that I will not know everything there is to know. I started slowly by creating the right cross-validation data for validating model performance. Creating the right split was crucial for this competition, and as I will mention later, the cross-validation score was critical in determining the final outcome. Then, I created a baseline model using ResNet18  with a basic input processing pipeline. This helped me get an AUC of 0.84 on the public LeaderBoard (LB) and allowed me to complete the pipeline of submitting a result.
I usually breathe a sigh of relief once the end-to-end pipeline of anything that I am working on is complete. This allows me the flexibility to tune individual components and be sure that there are no unknown unknowns down the road that would need my attention. So, I was happy after doing this, and going forward, I kept adding new components to it one at a time, and I’ll describe what finally landed me in the top 5% out of 3000+ competitors.
Before describing the final solution, I want to share three philosophies that helped me iterate faster:
The first one is a piece of advice from Jeremy Howard from fast.ai that practically, you often do not need all of the data or even the entire input to arrive at a decent performance. This could mean using a fraction of the entire dataset and using smaller resolution (say, 224 x 224) instead of using the entire image, which could be 1024 x 1024 and take much time to load and process.
The other philosophy can be considered widely known to matter for improving model performance truly. However, I will still state it explicitly — feed the data correctly, more data helps, use the correct data augmentations, find the right optimization recipe, and identify the right model class. Just focusing on these things should land you a pretty decent performance. It is all about finding the right mix now. Once I found the right recipe on images with a smaller resolution, I used the same recipe with a larger image resolution. That is expected to improve performance.
Ensembling generally always helps. Model ensembling has a theoretical justification for model improvement with the current set of base models. This blog post is an excellent reference. Simple techniques like combining diverse models (models with low correlation in their predictions) and combining the same model trained on different input sizes can provide a significant lift.
Once I had a solid base, I played around with different aspects of the entire pipeline, which I list towards the end of the last section— model, optimization, image sizes, using external data (data that is not a part of the competition), data augmentation and handling data imbalance. My best model was an ensemble (taking the mean of the predictions of the individual models) of 4 models.
3 models were trained on 512x512 images and 1 model on 384x384.
2 models use the following augmentations (using
- name: Rescale params: value: 255. - name: RandomAffine params: degrees: 180 translate: - 0.02 - 0.02 - name: RandomHorizontalFlip params: p: 0.5 - name: ColorJitter params: saturation: - 0.7 - 1.3 contrast: - 0.8 - 1.2 brightness: 0.1 - name: Normalize params: mean: imagenet std: imagenet
The meaning of different parameters can be found from
kornia's documentation. The other 2 models additionally use Cutout :
- name: RandomErasing params: p: 0.5 ratio: - 0.3 - 3.3 scale: - 0.02 - 0.1
To iterate faster and handle data imbalance, instead of upsampling the minority class, I downsample the majority class per epoch. However, to avoid wasting data, I sample different instances from the majority class per epoch.
The same network architecture is used for all the 4 models — EfficientNet-B5  features followed by a Linear layer. I tried a vast range of models, but EfficientNet-B5 was the best single model.
One of the decisions I made for iterating faster was to restrict the number of epochs to 20. To achieve this, I used the
OneCycle learning rate scheduler . This scheduler requires one to specify the minimum and maximum learning rates. These two were found to be 5e-6 and 2e-4, respectively, using the LR-range-test. Also, to address overfitting, I apply a weight decay of 0.1.
Adam  usually obtains the best results in rapid time. However, this  paper clarifies how weight decay is not applied correctly in Adam and proposes a modification to
AdamW. That is the final optimizer that I have used.
TTA was introduced in fast.ai to improve model performance during inference. Contrary to the general wisdom of turning data augmentation off during inference, TTA refers to keeping it on during inference. Since there is randomness associated with augmentations, trusting on running inference only once can lead to wrong conclusions and might even give worse performance. Hence, in TTA, we run inference for
N_TTA number of times. Finally, to obtain the prediction for each instance, we combine the predictions across the
N_TTA inference runs. One simple way of combining is by taking the mean of the predictions. For this competition, I used
N_TTA = 15. Note that this is computationally very expensive, but it leads to clear performance improvement for this task.
That’s it. These components helped me land in the top 5% of the competition in just 2 weeks. If they seem too simple to you, then you are right!
However, there were still many ideas that I wanted to try but I couldn’t, given the time shortage. I’m also listing them down in case there is something to learn from them:
There were many overly complicated things mentioned in the discussions like generating images from the data and using that as additional training data. These might be good ideas but I tend to like simplistic solutions and I wanted to achieve the best possible results using them. I still feel that I could get a higher score simply by using a better ensembling technique.
Ensembling is the technique of combining the predictions of several independent models/learners. I won’t go into technical details here as there are many excellent articles that already do that (this one for example). I noticed people reporting cross-validation (CV) score much higher than what I was getting with a similar setup. It was only much later that I realized that the best CV score for a single model was still very close to what I was getting. The winners used advanced ensembling techniques like stacking, ranking (this one is specific to the metric being optimized here), and model blending to improve their final performance dramatically. One of the top submissions was actually just a weighted average of 20 public submissions. Thus, one should focus on getting the right ensembling recipe.
One of the dramatic moments of the competition was when the private leaderboard results were opened. There was a massive shake-up in the entire leaderboard with many top submissions dropping significantly low and many teams jumping up more than 1000 spots (I myself climbed around 800 spots). This left many people reasonably disappointed as they had overfitted to the public leaderboard. However, a lot of the solution overviews posted after the competition ended strongly emphasized the need to focus on the CV score as the public leaderboard can often lie but they found a strong correlation between their CV scores and the private leaderboard ranking. So, the mantra became: “In CV, we trust”. Because I found this out very late, contrary to the general method of reporting cross-validation performance by taking the mean and standard deviation across all the folds, there is a better way that is typically used to report CV score in Kaggle competitions. This notebook illustrates that very nicely. Essentially, for each fold, you should save your Out-Of-Fold (OOF) predictions (predictions for the instances forming the validation set in that particular fold) for each fold. At the end of 5 folds, you will have a list of predictions for each sample in the train set and you should compute the metric on this list of predictions to get your final CV score.
Many people shared their submission notebooks publicly which truly helped clarify several doubts that often stay, even after people have tried their best (or not) to explain their methodology. It also helps to learn about minor implementation details that often get left out while talking about the big picture. Additionally, reading other people’s code is a great way of improving your own coding skills. I personally learned quite a bit as well. Finally, Kaggle Kernels offer both GPU and TPU for a limited quota. This is awesome and it removes the need to depend on on-prem infra or the requirement of being able to spend a lot of money on cloud VMs.
I was truly surprised to see how willing people were to engage with each other, often sharing their code and data to provide a very good starting point for other people. People also shared their approaches while the competition was still ongoing while also discussing what worked for them and what didn’t. Considering the fact that this is a competition with a monetary prize associated with it, I was truly taken aback by how collaborative the nature of discussions was and how kind everyone was to any newcomer asking silly questions as well.
Finally, I just want to say that it was amazing to have found a community of doers — people who focus on actually getting the job done, with whom I could have deep, meaningful technical discussions (and jokes). I have struggled to find the right online community for myself and although I am yet to find my footing there, I definitely know that I am here to stay! :)