ResNet Strikes Back: A Training Procedure in TIMM

Today, we're reevaluating the ResNet architecture with the help of the novel deep learning advancements that've been published since ResNet's first moment in the sun.
Aman Arora


The mighty ResNet Architecture was introduced in 2015 by He et al . Since then, a lot has changed in computer vision. Deep learning is an ever-evolving field with new research papers and ideas coming out weekly. And yet, when we talk about the ResNet architecture, we look at the top-1 ImageNet accuracy as a metric from 2015 which serves as baselines for newer proposed architectures.
There has been significant progress on best practices for training neural networks since then. Novel optimization & data augmentation have increased the effectiveness of the training recipes.
As part of this research paper, Wightman et al re-evaluate the performance of the vanilla ResNet-50 when trained with a procedure that integrates such advances.
(As part of this report, we will not be covering what the ResNet architecture is. For an introduction to the architecture, please refer to the paper reading group hosted at Weights and Biases.)

Key contributions

At a high level, this paper:
  1. The researchers propose three training procedures intended to be strong baselines for a vanilla ResNet-50 used at inference resolution 224 ×224.
  2. The training procedures include recent advances from the literature as well as new proposals. Noticeably, instead of using the usual cross-entropy loss, and instead, the training solves a multi-classification problem when using Mixup and CutMix.
  3. The stability of the accuracy over a large number of runs with different seeds was measured. This paper also includes a discussion on the overfitting issue by jointly comparing the performance on ImageNet-val versus the one obtained in ImageNet-V2.
  4. As part of this paper, the researchers train popular architectures and re-evaluate their performance.

Training procedures in the paper

As part of this research, the authors test three different training procedures with different costs and performance so as to cover different use-cases. These procedures target the best performance of ResNet-50 when tested at resolution 224 ×224.
Procedure A1 aims at providing the best performance for ResNet-50. It is therefore the longest in terms of epochs (600) and training time (4.6 days on one node with 4 V100 32GB GPUs).
Procedure A2 is a 300 epochs schedule comparable to several modern procedures like DeiT, except with a larger batch size of 2048 and other choices introduced for all our recipes.
Procedure A3 aims to outperform the original ResNet-50 procedure with a short schedule of 100 epochs and a batch size of 2048. It can be trained in 15 hours on 4 V100 16GB GPUs and could be a good setting for exploratory research or studies.
In the following table, you can compare all details related to the three training procedures above to some used in other research papers.
Table 1: Ingredients and hyper-parameters used for ResNet-50 training in different papers.

Logging configs to W&B

Weights and Biases can take care of these configs very easily for you! It is very simple to log configs to W&B. To learn how to log configs, check out this "Configs in W&B" notebook.
As part of this report, I re-ran training for A2 configuration on ResNet-50 and logged the metrics and configs using Weights and Biases. The training run can be found here.
When using TIMM, it's really simple to log config to W&B. After parsing args, all we need to do is run this simple line of code:
wandb.init(project=args.experiment, config=args)
W&B automatically logs the config from ArgumentParser. It looks something like this:
Figure-1: W&B Config
Put simply: Weights and Biases automatically logged the 107 hyperparameter values. This makes it really easy to reproduce experiments in the future! To learn more about logging configs using W&B, refer to the docs here.

Saving and using model artifacts with W&B for training

As part of the ResNet strikes back paper, model weights for ResNets for all three training procedures A1, A2, and A3 have been provided in the TIMM repository here. To use these weights, you need to download them from GitHub locally, which can be slightly cumbersome as compared to using W&B artifacts as shown below.
But first, what are Artifacts? From our docs:
W&B Artifacts was designed to make it effortless to version your datasets and models, regardless of whether you want to store your files with W&B or whether you already have a bucket you want W&B to track. Once you've tracked your datasets or model files, W&B will automatically log each and every modification, giving you a complete and auditable history of changes to your files. This lets you focus on the fun and important parts of evolving your datasets and training your models, while W&B handles the otherwise tedious process of tracking all the details.
This is another added benefit we get integrating with Weights and Biases. During training, all model files get automatically stored to W&B for future use if needed:
Figure-2: W&B Artifact
Logging artifacts to Weights and Biases is painless! All it takes is 3 lines of code:
artifact = wandb.Artifact(, type='model')artifact.add_file()
And yes: that's really it! So now after every epoch, we can log the artifact to W&B. Code for the updated CheckpointSaver can be found here.
W&B Artifacts are also great for audit and compliance purposes as everything becomes reproducible! More about Artifacts for compliance and audit purposes can be found in this report here.
As part of this report, I logged ResNet-50 weights for each of the three procedures A1, A2, and A3 to W&B. You can find that here.
Figure-3: ResNet-50 weights for training procedures A1, A2, and A3 on W&B
You can now use these weights simply by downloading these artifacts:
import wandb api = wandb.Api()artifact = api.artifact('amanarora/rsb/resnet_50_a1:v0')
This will download the model weights and now we can load these model weights using PyTorch and run evaluation on them! Therefore, by logging artifacts to W&B, it is now super easy for you (or anyone on your team) to download model weights.

Are the models trained using the three procedures any different?

Gradient-weighted Class Activation Mapping (Grad-CAM) serves as a great way for producing "visual explanations" for decisions from a large class of CNN-based models. In this section, we will use the model weights logged above to check if there are any differences between the models trained with these three training recipes.
Therefore, to see if there are any differences between the models trained using the three training procedures A1, A2, and A3, I decided to plot the Grad-CAM activations on a tiny ImageNette validation dataset in a W&B Table as shown below.
Are you able to spot the differences?
Generally, you will find the Grad-CAM for training procedures A1 & A2 is more focused. That is: models trained with these training procedures focus on the exact object inside the image based on the image category, whereas for A3, the Grad-CAMs are more spread out. This is because training procedures A1 & A2 have a higher number of epochs and are more accurate than A3.


As can be seen, integrating research with experiment tracking tools like Weights and Biases can be really beneficial! At every time, every training run (no matter how long in the past) is reproducible, all configs logged - this means you are always in control and everything is always in one place.
From logging training configs & model artifacts to Grad-CAMs using W&B Tables, there's a lot that this tool can do - and integrating it in your pipelines can only make things easier!