I trained on ImageNet for the "first time" - here's what I learnt
Created on April 21|Last edited on April 21
Comment
For long I have been planning to train my research paper implementations on ImageNet and try to replicate results as mentioned in the paper. Finally, I did manage to train ResNet-RS on the ImageNet dataset.
I implemented Resnet-RS architecture in TIMM in PyTorch and the code for the implementation can be found here.
Here are some key points:
How long did it take to train the model?
It took a total of 47 Hours on 4 x V100 GPUs! That's around $376USD worth of compute if you're using GCP (like I did).
How does the training curve look like?
For a first timer and a newbie, I had no idea on what a training curve for ImageNet for ResNet-RS model looks like. It's not something that's usually shared in the research papers so I was worried when my model was only at 2% Top-1 accuracy for the first 10 epochs!
Apologies that the run is broken down into 5 runs instead of 1. It's because every time I lost ssh connection to the machine, my run crashed. A solution has been provided next for this..
Run set
5
Why did my runs crash? How to solve this?
I was using the typical process as below:
- SSH into the cloud machine
- Kickoff training run
But hey, what would happen if you turn of your WiFi or take your laptop outside a network region? The run would crash!
- SSH into the cloud machine
- Start a tmux window
- Kickoff the training run inside tmux window
Add a comment