Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference (CVPR 2020)

A reproduction of the 2020 CVPR paper entitled 'Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference ' by Thomas Verelst et. al.
Sachin Malhotra

Reproducibility Summary

Scope of Reproducibility

This is a report for reproducibility challenge of CVPR 2020 paper "Dynamic Convolutions: Exploiting Spatial sparsity for Faster Inference " by Thomas Verelst et. al. It covers each aspect of reproducing the results and claims put forth in the paper. This paper primarily proposes a method to dynamically apply convolutions conditioned on the input image. This paper also introduces a residual block where a small gating branch learns which spatial positions should be evaluated. The authors have released the code implementation which can be seen on this Github repository. This report aims at replicating their reported results, additionally by using wandb library to track various hyper-parameters during the training and validation phases.

Methodology

For reproducing the paper, we used the original implementation provided by the authors on their Github repository. The paper represents various useful models and datasets used for the implementation of train pixel-wise gating masks. We made changes to their codes and included the support for experiment tracking via Weights & Biases API. For our experiments, we used a NVIDA Tesla P100 GPU with 16 GB RAM on the Google Cloud Platform (GCP). Moreover, we used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard. We reproduced the results mainly on CIFAR 10 dataset due to lack of time and resources.

Results

We were able to reproduce the main results reported in the paper using the GPUs. As suggested in the paper we use CIFAR-10 dataset. The CIFAR-10 dataset include 50k training and 10k test images of size 32 px × 32 px and comprises 10 classes. Moreover, as mentioned we train for 350 epochs with 256 batch size.

What was easy

The structure of the paper was quite relevant as it displayed the very need of the reducing computations by executing conditionally in the spatial domain and how it was unique from already existing designs. The method proposed in the paper was fairly easy and straightforward to integrate with existing architectures like the Residual Network (ResNets) used for the experiments in our reproducibility attempt. The references have been provided in the paper to make it easier to reimplement. The authors provided with the original implementation which was useful for the several experiments.

What was difficult

We didn't face any significant hurdles in our reproducibility attempt of the paper. The only component that we felt to be a bottleneck was the high compute requirement to reproduce all the experiments from the paper. The paper lists experiments on ImageNet, CIFAR and MPII classification using deep architectures. ImageNet dataset consists of 1.28 million training images and 50k validation images from 1000 classes and MPII includes around 25K images containing over 40K people with annotated body joints that would have required extensive compute clusters which weren't available to us during the time of this challenge.

Communication with original authors

We would like to thank the authors of the paper for their delicate attention to detail in providing concise code. We would also like to extend our gratitude to the authors for performing experiments outside the scope of this paper in the ablation studies. They have provided well-written explanation of each component which made it very convenient to understand the effect. Hence, we didn't feel the need of getting any clarifications from the authors.

Introduction

In recent times, research on deep neural networks is highly focused on improving accuracy without taking into account the model complexity. From classification to detection or pose estimation architectures grow in computational complexity and capacity. Modern convolutions neural networks apply the same operations on every pixel in an image. However, in most cases not all image regions are equally important. In many images, the subject we want to classify or detect is surrounded by background pixels, where the necessary features can be extracted using only few operations. For example, flat regions such as a blue sky can easily be identified. We call such images spatially sparse. To address this issue, authors propose a method, trained end-to-end without explicit spatial supervision, to execute convolutional filters on important image locations only. For each residual block, a small gating network chooses the locations to apply dynamic convolutions on (see figure). Gating decisions are trained end-to-end using the Gumbel-Softmax trick . Those decisions progress throughout the network: the first stages extract features from complex regions in the image, while the last layers use higher-level information to focus on the region of interest only.

Scope of Reproducibility

The scope of our reproducibility is encapsulated in a :
  1. Reproduce and validate the performance of the dynamic convolution model using the same training settings as that used in the paper.
  2. To see whether we achieves the state-of-art-model results on classification tasks with ResNet as it is claimed by authors in the paper.

Methodology

For initial understanding and clarity we investigated the original implementation provided by the authors at https://github.com/thomasverelst/dynconv. To make their code run, we installed the missing libraries along with the versions of libraries that they were using and made modifications to their codebase. We try to use freely available compute resources like Google Colaboratory for training models but couldn't succeed. So, we used Google Cloud Platform (GCP). Moreover, we used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard.

Datasets

In this paper we primarily used a single dataset: CIFAR-10. This dataset is publicly available and is used as standard datasets for benchmarking convolutional neural networks. CIFAR-10 dataset can be directly imported using Tensorflow and Pytorch. Due to the lack of time, code and computation resources, we could not able to reproduce the results for ImageNet dataset.

Experimental Setup

For all the experiments, we trained the models on a NVIDA Tesla P100 with 16 GB RAM on Google Cloud Platform. All the experiments were conducted based on our public reimplementation repository available at repo link.

Results

We perform experiments with Vanilla ResNet-32 on the standard train/validation split of CIFAR-10 dataset. We used the same hyperparameters and data augmentation as mentioned in the paper, being an SGD optimizer with momentum 0.9, weight decay 5e-4, learning rate 0.1 decayed by 0.1 at epoch 150 and 250 with a total of 350 epochs. For dynamic convolutions we used the computational budget with three different values which is 0.15, 0.5 and 0.75. As we increases the budget every time there is increase in the top-1 accuracy which can be see in the run set table. Moreover, they also achieved state-of-art-results on ResNet with budget of 0.75. Due to the lack of time, code and computation resources, we could not able to reproduce the results for ImageNet dataset so we can't validate the claims mentioned in the paper for same. For transparency and fair comparison, we even provide the validation and training curves of the model trained in the plots shown below. Additionally, we have also provide the graphs of the system statistics during the training.