Skip to main content

I trained on ImageNet for the "first time" - here's what I learnt

Created on April 21|Last edited on April 21
For long I have been planning to train my research paper implementations on ImageNet and try to replicate results as mentioned in the paper. Finally, I did manage to train ResNet-RS on the ImageNet dataset.
I implemented Resnet-RS architecture in TIMM in PyTorch and the code for the implementation can be found here.
Here are some key points:

How long did it take to train the model?

It took a total of 47 Hours on 4 x V100 GPUs! That's around $376USD worth of compute if you're using GCP (like I did).


How does the training curve look like?

For a first timer and a newbie, I had no idea on what a training curve for ImageNet for ResNet-RS model looks like. It's not something that's usually shared in the research papers so I was worried when my model was only at 2% Top-1 accuracy for the first 10 epochs!
Have a look at the eval-top-1 metric below for yourselves! :)
Apologies that the run is broken down into 5 runs instead of 1. It's because every time I lost ssh connection to the machine, my run crashed. A solution has been provided next for this..


Run set
5



Why did my runs crash? How to solve this?

I was using the typical process as below:
  1. SSH into the cloud machine
  2. Kickoff training run
But hey, what would happen if you turn of your WiFi or take your laptop outside a network region? The run would crash!
As a workaround use "tmux" and follow this process instead:
  1. SSH into the cloud machine
  2. Start a tmux window
  3. Kickoff the training run inside tmux window