How Well Can You Kaggle with Just One Hour a Day?

A whimsical challenge to see how far one can go in a Kaggle competition with limited time, limited resources and pretty much limited everything else.
Sairam Sundaresan
Created on March 3|Last edited on May 14
Comment
﻿
Why Do This in the First Place?I've been a Kaggle enthusiast for quite some time now. And while I've been sucked in by the sheer enjoyment and intensity a contest brings, I've really not had much luck leveling up or learning in the process. Why is that? Well, the answer is simple: I have had no time to dedicate to Kaggling and thus resorted to shortcuts. 
I've often heard that participating in Kaggle competitions is like a full time job in itself, and boy oh boy is that true. To give you a bit of perspective, I'm an average Joe with a full time job and family commitments. In the little time that I occasionally get, I like to dabble in Kaggle competitions. My modus operandi until this point has been to find competitions that I find interesting, and then tweak the living daylights out of public kernels/notebooks in the time I have. I used to rinse and repeat this approach, and safe to say, it got me nowhere–and rightly so! 
At this point, I realized that I had to change my approach to Kaggle. There are several blogs and tutorials out there from generous Kaggle masters and grandmasters with quality advice in there but what's often missing is the raw experience of a time and resource crunched newbie/enthusiast who's been struggling through a competition and trying to learn something in the process. For example, what didn't work? What went wrong? How did you get past frustrations? 
Which brings us to this blog post: I decided to see how far I could get in a competition by doing things the "right" way, albeit with some rules.
﻿
﻿
﻿
The RulesHistorically, the two excuses I used for not doing better in Kaggle competitions were that I had too little spare time and no dedicated GPUs. The question is: was that accurate? 
To see if there was any truth to this, I set the following rules for myself while competing:
Work just 1 hour a day for 60 days (This does not include the time it takes for models to train but only the active time spent coding, analyzing, etc.)
If I missed a day, too bad. That one hour of time is lost to the ether and cannot be added to the pending time.
Log and document every step, collect references, and attribute credit where due 
No blindly modifying and reusing existing kernels (especially top scoring ones)
Only use Kaggle kernels or Google Colab for compute resources
Only use the public leaderboard to evaluate current position and strategy 
Strictly abide by the rules of the contest
Use the discussion forums
Rest adequately
So, 60 hours of active time to work on a contest using only publicly available compute.
The CompetitionTo avoid the pressure and the temptation to work longer, I chose a competition that had finished. In this case, the Humpback whale identification challenge. Why? Well, this competition was challenging but would be manageable in terms of the dataset size on publicly available compute resources. Plus, it gave me the opportunity to learn about few-shot classification.
The premise of the challenge is deceptively simple: to identify a whale from the image of its "fluke" (basically: its tail). You'd be evaluated by your Mean Average Precision @5 score (mAP@5 ). Essentially, you'd be rewarded if your model's top 5 guesses contained the whale in question, but would be rewarded more if your model's ranking and the ground truth lined up in a similar fashion. Therefore, the order of the predictions matter. Having chosen this challenge, I set about creating a journal to keep track of my daily progress as well as things I had to do next.
Click the toggle to expand my progress journal
The SetupHere is my setup in five quick steps: 
Compute resources: Google Colab + Kaggle kernels (~30 hours/week of GPU time (not coding time!))
Journaling: Notion page (see gif above) where I kept an active to do list, a daily journal section and a references section
Software resources: PyTorch, FastAI (eventually) and Weights & Biases
Lots of ☕ 
One pomodoro-style timer to stop me at 60 minutes each day.
﻿
				     Insane productivity hack
﻿
The ProcessIn this section, I'll try to walk you through what I tried each week and I'll include the learnings, results and the things that might be interesting to you as an observer. 
Broadly: my personal challenge lasted nine weeks, although technically, the last week was all of 3 days. For each week, I had a to-do list which would be as concise as possible and a daily journal documenting my work for that day. The timer made sure that I stopped at 60 minutes each day. 
At the end of each week, I'd check off the things I finished on my to-do list for that week, and carry over the incomplete ones to the next. Over the course of time, I tried a surprisingly large number of things and when I look back now, I can see how I could have used my time better (don't we all?). 
Week 1 - Exploratory Data Analysis:In the first week, I tried to keep things simple. My big goals were to ingest the data, log things on W&B and analyze the data to see what I could glean from it. 
Initially, I wanted to write my pipeline from scratch using PyTorch, Pandas and the like. After all, wasn't the goal of this exercise to learn something? Buoyed by my initial optimism, I forged ahead and started putting together the data-loader and logging hooks in place. Additionally, I ran some experiments to get a better sense of the data. 
﻿
﻿
For example, above, you have a random sample of training images. Immediately, you can see that the images are taken at various poses, are sometimes in grayscale, have text in them and aren't always sharp. Additionally, I found that the dataset was extremely imbalanced with one of the 5005 classes having more than 9k images while some classes only had 1 image. This gave me some options to think of:
Discard low sample classes (I tried this but it made things worse)
Account for blur, pose, aspect ratio etc.
Handle grayscale images differently
Rebalance the dataset through augmentation (what I ended up doing)
Remove the new_whale class and use other ways to predict this class (what I ended up doing) 
Weeks 2 & 3 - Pipeline Building and Dataset Cleaning:Over the next couple of weeks, I hunkered down and wrote my training and inference pipelines using vanilla Pytorch. Given that my primary sources of compute were Kaggle and Colab, I had to start writing things out in Jupyter notebooks and then refactor cells into scripts. 
There were benefits and drawbacks to this approach. For starters, I could thoroughly test everything and maintain changes neatly in a git repository. Additionally, some of these functions could be reused for future contests. However, given that I used only Kaggle kernels and Colab, it was very difficult to integrate git repositories with either of these and manage changes (at least at the time of my attempting the contest). 
Another sticking point was that I had only one hour a day, and the time it took to move back and forth between scripts and notebooks took was compounding. To maximize my time, I begrudgingly bit the bullet and stuck to using Jupyter notebooks.
Three key insights here: 
A model trained on the whole training set almost always predicted new_whale as one of its guesses. Why? There were 9k examples of this class in the data and this dwarfed the other classes. Thus, the model blindly began to predict this class for almost every example in the validation and test sets (public mAP@5 0.541). This appeared to be a common trend per the discussion forums too, so I used the wisdom of the crowd and removed all the "new whale" images. During inference time, I would use a threshold on the confidence scores of the model to see if the given example was a new_whale  (More on this later).  
If I wanted to use stratified K-fold cross validation (CV), I'd have to handle the low sample classes as some of the them had just one image. For this, I tried to drop all classes with less than 5 images and then use K-fold CV. However, this deteriorated the score significantly (public mAP@5 0.452). In the discussion forums, there were several posts on using an object detector to detect and crop out a tight area around the whale and omit the background. Through some digging, I found an absolute gem of a notebook from Martin Piotte which trained a model to do just this. Given that the competition was a couple of years old at the time, I updated this notebook to work correctly and trained the model to detect and crop the whale out from the image. This improved the score to 0.486.
Instead of removing the low sample classes, I tried using data augmentation using Albumentations to increase the sample count of the underrepresented classes. This yielded significant improvement and with just 20 epochs of training, my model scored an mAP@5 of 0.611. 
To sum up, my data preprocessing was:
Remove all new_whale images from the training set.
Crop all the remaining images using the trained bounding box detector model (Note: the test images were also cropped to make sure the model wasn't surprised by background scenery 😅 ).
Augment all classes with less than 20 images to have at least 20 images.
Stratified 5-fold CV.
Weeks 4 & 5 - Stunted Progress and Debugging:
Easy PickingsWith the data preprocessing kind of solidified, I next focused on trying out some quick experiments to see where the best gains were to be had. To conduct "fair" comparisons, I fixed the training to 20 epochs, and evaluated each setup across all the folds. 
For instance, I tried out  different architectures from Resnet-18, Resnet-34, Resnet-50, VGG-16 (all pre-trained on ImageNet of course) and so on. You might be wondering why I didn't try the latest and greatest.  Given finite compute resources, and a time out limit (~9 hours in Kaggle and ~12 hours in Colab), there were only certain models I could fit in memory with a reasonable batch size and get something meaningful out of the exercise. I also tried different loss functions (vanilla cross-entropy, focal loss, etc), and image sizes (224, 384, 512). Here's what I found from these experiments:
Image size was a big influencer with the mAP@5 increasing steadily when the image size increased. I decided to try out all my experiments with the smallest size (224) and then towards the end, retrain the models with the larger dimensions to push the score. 
Model architecture played a smaller but still significant role with the larger models (Resnet-50 for example) doing better than the smaller ones. 
Interestingly, focal loss which is typically used in imbalanced class problems didn't help at all.
Wish ListIn addition to these experiments, I wanted to try some tricks such as mixup, label smoothing and progressive resizing to see if these helped further. Given that a lot of top scores are a result of ensembling, I wanted to have a variety of models to ensemble at a later stage. There was a lot of chatter in the forums about metric learning and contrastive learning and this was another direction I wanted to try if time permitted.
Unintended Time Sink(s)Despite some promising directions to explore, I was hampered by two things. 
First: my implementation of the contest metric was way too optimistic. For example, for a public score of 0.54, my predicted cv score was 0.8. This made me question both my code, and possibly my CV-split. 
The second issue I was seeing was that no matter what, the public score did not seem to cross 0.7 while a lot of the posts in the discussion forum indicated that this was quite possible with the combinations I had tried so far. For the former issue, I found two better implementations of the metric in the public kernels and in github (Thanks Radek!) and used them which made my CV scores more meaningful. For the second, I plotted my images after augmentation and was shocked to find several artifacts caused by the choice of padding. Below are two sets of images before and after I corrected the artifacts:
﻿
﻿
﻿
﻿
﻿
This really set me back because now I had to recreate my dataset and then on top of that retry all previous experiments to see if my findings were indeed valid. The rest of the time over these two weeks, I only had time to validate my findings and correct the dataset 😣 .
Weeks 6 & 7 - FastAI to the Rescue:Having gotten past the setbacks, I was faced with the daunting task of trying out some of the more useful tricks mentioned above, and also retrain my final models with larger resolution images. At this point, I had to make some hard calls. I decided to abandon the pursuit of metric and contrastive learning and only focus on improving my pure classification based models. 
Additionally, I figured that the time it would take for me to implement, test and use some of the fancier tricks would be prohibitive. I thus decided to turn to FastAI and use their wonderful framework for the remainder of the challenge. For the 1e-7% of you who have not heard of the FastAI library, it is a framework built on top of Pytorch akin to Keras for Tensorflow. In addition to being a higher level abstraction which simplifies training and inference, they have implementations for a lot of techniques (mixup, label smoothing, progressive resizing, freezing and unfreezing weights to name a few) baked in which you can use incredibly easily. Armed with this decision and the Deep learning for coders book, I went about feverishly trying to improve my models. 
Within a couple of hours, I replicated my vanilla pipeline in FastAI and integrated all the other bells and whistles that I wanted to debug my model. The new data was much better and yielded an improved score of 0.72-0.73 in all 5 folds after just 20 epochs of training. Here's a summary of what I tried over these two weeks:
Label smoothing (didn't help as much as I thought it would)
Mixup (made things worse at least in my case)
Progressive resizing (what I ended up using finally)
Test time augmentation (didn't help)
Discriminative learning rates (another I ended up using finally)
One point I haven't yet brought up is how I predicted the new_whale class. Let's do that now. 
Since I removed all instances of the new_whale class, the model would not predict it during inference. To re-introduce it into the predictions, I used the softmax scores of the predictions as a guide. Given the top 5 predictions and their softmax scores, I'd see if any of these scores (and what followed below them) were below a threshold. If they were, then the new whale class would be introduced as a prediction right before this score which was lesser than the threshold. Note that this was a strategy proposed in the discussion forums and not my own invention. To find the optimal threshold, I ran a grid search for each fold and used that value for the final predictions. By the end of these two weeks, my best score was 0.8 after 20 epochs of training and at a resolution of 384 x 384.
Weeks 8 & 9 - The Final Countdown:Over the course of these final two weeks, I focused on doing longer runs and progressively resizing the model up to 512 x 512 from 224 x 224. This took a lot of trial and error and eventually the best approach in my case was a run of 65 epochs at 384 x 384 followed by fine tuning at a resolution of 512 x 512. This involved a lot of painstaking checkpointing. To train a model at 384 x 384 for 65 epochs without time out issues, I had to get a certain type of GPU on Colab which was not always a given. So depending on the GPU, I had to choose the number of epochs I would have to train and then checkpoint that model so as to not lose it. Then I would restore the weights and train for the remainder of the epochs. Clunky and cumbersome, I know. Now, repeat this process for all 5 folds and then fine tune each fold for 5 epochs at 512 x 512 and you'll be where I am right now.
At the end of this process my best model (based on the public leaderboard) was a Resnet-34 trained as mentioned above with 5-fold stratified CV. I submitted this and the best single fold of this model as my final submissions to the contest. By this point, I was exhausted but chuffed that I stuck to and finished the challenge.
ResultsThese were my best scores:
													Best single fold result
													Average of 5 folds
﻿
Hypothetical Position on final leaderboard: 258/2120 -> Top 12.16% and missed the bronze by 0.012 points
What I LearnedI was thrilled to see how consistent little time invested each day compounded and helped me move up the board. Although this was a finished contest when I attempted it, I still felt the competitive rush while I participated in it. There were a lot of things I feel I could have done differently:
I should have allowed myself the flexibility of 7 hours a week to be used as I saw fit versus one hour per day each week. Creativity and motivation comes in bursts and there were many days when I felt I wanted to go on for longer but couldn't because of the rules I set for myself. Conversely, there were days when I was exhausted and didn't have the same drive to work on the competition.
I should have verified my data processing much more thoroughly. This cost me several hours of debugging time which in my case was very costly. Had I done this sooner, I would have had more time to try more things and perhaps push my score further.
Libraries like FastAI are god sends in the sense that they accelerate your prototyping by orders of magnitude. However, they come at the cost that you need to invest time learning them well to maximize your results. 
I should have thought through how much code I wanted to write from scratch. The tradeoff of learning versus scoring high is tricky and given that I had an hour a day, and used Colab and Kaggle kernels, I should have not gone along the direction of creating a self-contained repository of scripts to use for the contest. 
Finally, I'd like to say that both Kaggle Kernels and Google Colab make compute freely available for a huge population that don't have their own workstations and/or cloud setup. However, they are free for a reason. They are meant to encourage experimentation and research, and are not for competitions IMO. It was a laborious process to make sure I could train models and infer results without having time-outs happen on me. Additionally, training larger models (EfficientNet and the like) was very difficult since you can't usually fit them in memory and have to play with either the image resolution and/or the batch size to be able to get an epoch or two of training done. If you want to win in Kaggle, yes, you can use these resources, but it depends on how plucky and determined you are. For those with limited time to spare like me, this is a definitely a dent.
﻿
Final ThoughtsThis challenge really pushed me and it helped me learn and unlearn things as far as Kaggle is concerned. Going forward, I'd adopt a flexible version of the rules I set for myself here for future Kaggle contests. If you're interested in trying your hand at this challenge, I've made the dataset I created (thanks to innumerable Kagglers) public here. Good luck and may the force be with you!
﻿
Add a comment
Tags: Articles, Intermediate, Kaggle, Computer Vision, Classification, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.