Skip to main content

Finding the New ResNet18

In this article, we try to understand how to choose the right backbone for fine-tuning a model on downstream datasets in an attempt to find the new ResNet18.
Created on May 17|Last edited on January 18
With more than 500 pre-trained models on timm, choosing the right backbone is not a simple task.
It depends on what you want to achieve, of course. In this article, we'll limit ourselves to a specific task: fine-tune a model on downstream datasets.

Table of Contents




This article is a companion to Jeremy's Kaggle notebook the best vision models for fine-tuning and a continuation of the analysis we did on Leveraging-Pre-Trained-Models-for-Image-Classification.
💡

TL;DR

Use convnext_tiny convnext_nano when the dataset is similar to ImageNet. Use one of the small Vit or Swin transformer-based models when the dataset isn't.

Which Model Should We Choose for Fine-Tuning?

How can we achieve this? Do some models transfer better than others? Are all models pre-trained on the same dataset? I won't pretend to give an exhaustive answer to all these questions, but we'll try anyway. The idea here is to improve the defaults that fastai uses on each model, but, of course, the fine-tuning technique will likely be model-dependent, and no single recipe will work for every case.
As we saw on Leveraging-Pre-Trained-Models-for-Image-Classification, newer models like ConvNext and RegNets fine-tune very well and provide accurate results with as little as 5 epochs.
A good starting point is a glance at this Kaggle notebook by Jeremy and Ross. They give us a useful tradeoff between inference speed and accuracy.
As we move to the right in the chart below, our models get heavier and slower, but our accuracy increases. As Jeremy says, start small and iterate fast. Ideally, pick something as "left" as possible and once your pipeline is perfect, increase the model complexity to fit your needs.


As we want a model to fine-tune in a different dataset, we want to check if this graph is reproduced for fine-tuning on the Pets dataset. To find out, we will need to train a bunch of models. The articles below, specifically the one on launching hyperparameter sweeps with fastai, are a great resource to understand what we're doing here:


Searching for the “Good” Model

Our experiment will be to fine-tune the Pets dataset for just 5 epochs. We will not try every single model but only the models from the groups above. We'll also filter out models that don't take image size = 224. We'll perform a sweep using fastai default fine_tune training strategy for 5 epochs. Lastly, we'll explore 2 ways to attach a model's head as well as different ways to resize the images. Let's start with the head first.

Different Ways To Attach a Head

In fastai land, we use the vision_learner method to create a Learner to train an image model. We pass the Dataloader and the model name we want to use. You can check the article above (with the yellow thumbnail) to get familiarized with fastai and timm integration.
Under the hood, the vision_learner is calling create_vision_model, which will stack a simple head on top of the timm backbone.
  • The body consists of the layers of the model. When using timm, this is as simple as calling the forward_features method in the corresponding model.
  • The head is the way of aggregating the features extracted by the model and outputting the logits to compute the classification.
# The standard fastai head from `create_head`
Sequential(
(0): AdaptiveConcatPool2d(
(ap): AdaptiveAvgPool2d(output_size=1)
(mp): AdaptiveMaxPool2d(output_size=1)
)
(1): Flatten(full=False)
(2): BatchNorm1d(768, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Dropout(p=0.25, inplace=False)
(4): Linear(in_features=768, out_features=512, bias=False)
(5): ReLU(inplace=True)
(6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Dropout(p=0.5, inplace=False)
(8): Linear(in_features=512, out_features=37, bias=False)
)

NOTE: We have two multiple choices for pooling layers, we will experiment with the simpler AdaptiveAvgPool2d and the AdaptiveConcatPool2d from fastai.
💡
Adaptive Concat Pooling layer: Instead of choosing between average and max pooling, this clever fastai layer uses both:
class AdaptiveConcatPool2d(Module):
"Layer that concats `AdaptiveAvgPool2d` and `AdaptiveMaxPool2d`"
def __init__(self, size=None):
self.size = size or 1
self.ap = nn.AdaptiveAvgPool2d(self.size)
self.mp = nn.AdaptiveMaxPool2d(self.size)
def forward(self, x): return torch.cat([self.mp(x), self.ap(x)], 1)
Note that this layer's output is twice as big as when using only one pooling layer.
💡
Average Pooling: As most of the models were trained with average pooling, we'll also explore attaching this type of layer before the head.

Different Ways of Resizing the Images

The models we chose expect input images of size 224 x 224 pixels. The dataset contains images of various sizes, so resizing the images before passing them through the model is needed. We have multiple ways of achieving this but we will explore two: crop and squish.

  • Crop: resize so that the larger dimension is matched and crop (randomly on the training set, center crop for the validation set)
  • Squish: resize the image by interpolating the pixels on the shorter side. It deforms the proportions of the image, but you get all the original image content.

The Sweeps Config File

This is one of the yaml files used to run the sweep. More on the actual configs can be found in the fastai's team page
method: grid
parameters:
concat_pool:
values:
- true
- false
model_name:
values:
- levit_128s
- regnetx_002
- regnety_002
- levit_128
- vit_small_patch32_224
- levit_192
- vit_tiny_r_s16_p8_224
- regnetx_004
- resnet18
- levit_256
- regnety_004
- regnetx_006
- resnet18d
- regnety_006
- efficientnet_lite0
- regnetx_008
- vit_tiny_patch16_224
- efficientnet_b0
- regnety_008
- resnet34
- levit_384
- resnet34d
- vit_base_patch32_224
- vit_base_patch32_224_sam
- resnet26
- efficientnet_es_pruned
- efficientnet_es
- resnet26d
- regnetx_016
- regnety_016
- vit_small_patch16_224
- resnetv2_50
- resnet50
- resnet50d
- convnext_tiny
- resnetblur50
- resnetrs50
- regnetx_032
- convnext_tiny_hnf
- regnetx_040
- vit_small_r26_s32_224
- resnetv2_101
- resnet101
- resnetv2_50x1_bit_distilled
- regnetx_080
- regnetx_064
- convnext_small
- resnet50_gn
- resnet152
- vit_base_patch16_224_miil
- vit_base_patch16_224_sam
- vit_base_patch16_224
- regnetx_120
- beit_base_patch16_224
- regnety_120
- convnext_base
- convnext_base_in22ft1k
- regnetx_160
- vit_large_r50_s32_224
- convnext_large_in22ft1k
- convnext_large
- regnety_320
- regnetx_320
- convnext_xlarge_in22ft1k
- vit_large_patch16_224
- beit_large_patch16_224
- resnetv2_152x2_bit_teacher
- vit_base_patch8_224
num_experiments:
values:
- 1
- 2
- 3
resize_method:
values:
- crop
- squish

Results for Pets

Ok, with our setup wrapped, let's compare performance for a couple of models

ResNets

For the old and trusty ResNets, it appears that the crop method coupled with concat pooling worked the best.



ViT

For the old and trusty ViT it appears that the crop method coupled with concat pooling also worked the best.



ConvNext

For the new ConvNext, it appears that AvgPool works best, also couple with crop.


Let's dig in a little more:

Aggregated Results

Here we explore the tradeoff between speed, GPU memory, and accuracy. If we filter by models that need less that 4GB GPU ram, we can see why Jeremy liked the convnext_tiny so much (it's also my favorite right now).
Citing Jeremy:
As you can see, the convnext, swin, and vit families are fairly dominant. The excellent showing of convnext_tiny matches my view that we should think of this as our default baseline for image recognition today. It's fast, accurate, and not too much of a memory hog. (And according to Ross Wightman, it could be even faster if NVIDIA and PyTorch make some changes to better optimise the operations it relies on!)
I would add that Karpathy's favorite Regnets (like regnetx_032 ) are also nice, but they are slower to train:



A very different dataset (The Planet competition dataset)

This dataset consists of satellite images from the Amazonian region, and our task consists of classifying what types of land cover are present on each image. From the Kaggle competition:
In this competition, Planet and its Brazilian partner SCCON are challenging Kagglers to label satellite image chips with atmospheric conditions and various land cover/land use classes. The resulting algorithms will help the global community better understand where, how, and why deforestation happens worldwide - and ultimately how to respond.
Here the task is a multi-label classification problem, where each image can belong to multiple classes. We will load a sample of this dataset from fastai. A sample of the dataset can be seen on the wandb.Table below. For each image, we have a 1 if the corresponding class is present and 0 if not:



What Is the Best Model on Planets?

Swin and ViT dominate here. As a hypothesis, I'd suppose this task is like learning from scratch, as transfer learning does not add much. The patch-based models are very efficient to learn this small dataset:



Conclusions

We have done quite a bit of experimenting here, and we found that we now have way better models available than the trusty ResNet18. We also explored the questions on how to attach a head and how resizing your images impacts the fine-tuning capabilities of the model.
Using concat pooling seems like the safe bet in general, but some models benefit from avg pooling, like the ConvNexts family. Probably if we train longer, this difference would fade as concat pooling "contains" average pooling. I think that this is just the tip of the iceberg and serves to challenge the defaults we use.
Possible exploration possibilities are training for longer, maybe after 25 epochs with a lower learning rate, the famous EfficientNets strike back.
My personal favorite model from now is the convnext_tiny; it's fast and accurate, ideal for iterating an improving your machine learning pipeline.
😱 Update: The new convnext_nano arch dethrones the convnext_tiny, it is slightly faster to train, uses less GPU memory and actually finetunes to a higher accuracy!
Thanks for reading!
jessica
jessica •  
convnext_nano
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.