Dynamic Convolution: Attention over Convolution Kernels (CVPR 2020)

A reproduction of the 2020 CVPR paper entitled 'Dynamic Convolution: Attention over Convolution Kernels' by Yinpeng Chen et. al. .
Sachin Malhotra

Reproducibility Summary

This is a report for reproducibility challenge of CVPR 2020 paper "Dynamic Convolution: Attention over Convolution Kernels" by Yinpeng Chen et. al. It covers each aspect of reproducing the results and claims put forth in the paper. This paper primarily proposes a novel design known as Dynamic Convolution. In this paper authors also increased the model complexity without expanding the network depth or width. The authors have released the code implementation which can be seen on this Github repository. This report aims at replicating their reported results, additionally by using "wandb" library to track various hyper-parameters during the training and validation phases.

Scope of Reproducibility

This paper proposes a new operator design, known as dynamic convolution which expands the representation capability with minor extra FLOPs. Instead of using a single convolution kernel per layer, dynamic convolution uses a set of k parallel convolution kernels. Moreover, these are computationally very efficient and it is a non-linear function. The key insight is that within reasonable cost of model size, dynamic kernel aggregation provides an efficient way (low extra FLOPs) to boost representation capability.

Methodology

For reproducing the paper, we used the original implementation provided by the authors on their Github repository. The paper represents various useful models and datasets used for the implementation of DY-CNNs. We made changes to their codes and included the support for experiment tracking via Weights & Biases API. For our experiments, we used a NVIDA Tesla P100 GPU with 16 GB RAM on the Google Cloud Platform (GCP). Moreover, we used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard. We reproduced the results mainly on CIFAR 10 dataset due to lack of time and resources.

Results

While most of the results we obtained in our experimentation conclusively suggest that DY-CNNs provides significant performance improvement at a lower computational overhead as the paper claims. Due to lack of time and resources, we couldn't explore the ablation studies and few other secondary experiments reported in the paper.

What was easy

The structure of the paper was quite relevant as it displayed the very need of the DY-CNNs design and how it was unique from already existing designs. The method proposed in the paper (DY-CNNs) was fairly easy and straightforward to integrate with existing architectures like the Residual Network (ResNets) family used for the experiments in our reproducibility attempt. The references have been provided in the paper to make it easier to reimplement. The authors provided with the original implementation which was useful for the several experiments.

What was difficult

We didn't face any significant hurdles in our reproducibility attempt of the paper. The only component that we felt to be a bottleneck was the high compute requirement to reproduce all the experiments from the paper. The paper lists experiments on ImageNet, Human Pose Estimation classification using deep architectures which would have required extensive compute clusters which weren't available to us during the time of this challenge.

Communication with Original Authors

We did not have any communication with the original authors.

Introduction

Figure - A dynamic convolution layer.
In recent times, Interest in building light-weight and efficient neural networks has accelerated. It protects user's privacy from sending personal information to the cloud. However, when the computational constraint becomes extremely low even the state-of-the-art efficient CNNs (e.g. MobileNetV3) suffer significant performance degradation. This paper proposes a new operator design, dynamic convolution to increase the representation capabilities with negligible extra FLOPs. Dynamic convolution uses a set of K parallel convolution kernels \{\tilde{W_{k}},\tilde{B_{k}\}} instead of using a single convolution kernel per layer (see Figure 2). Dynamic convolutional neural networks (denoted as DYCNNs) are more difficult to train, as they require joint optimization of all convolution kernels and the attention across multiple layer. The authors found two keys for efficient joint optimization:
  1. constraining the attention output as \Sigma_{k}\pi_{k}\{x\} = 1 to facilitate the learning of attention model \pi_{k}\{x\}.
  2. Flattening attention (near-uniform) in early training epochs to smooth learning of convolution kernels.
They simply integrate these two keys by using softmax with a large temperature for kernel attention.

Scope of reproducibility

Training Deep DY-CNNs involves the combination of all convolution kernels and attention model for streamlining among several layers. The approaches ,as per the paper, that can improve the efficiency of optimization are:
  1. Sum the Attention to One: The aggregated kernel lies between two pyramids when the following constraint is strictly applied on attention model:
0 ≤ πk(x) ≤ 1
The convex hull of the kernel space is compressed to the triangular shape. The normalization converts the red line of the attention sum to a dot.
2. Near-uniform Attention in Early Training Epochs: The kernel subset numbers are less across the Softmax layers to be optimized. The increase in temperature increases the efficiency of the training resulting in fewer dynamic convolution layers.

Methodology

For initial understanding and clarity we investigated the original implementation provided by the authors at Github. To make their code run, we installed the missing libraries along with the versions of libraries that they were using and made modifications to their codebase. We try to use freely available compute resources like Google Colaboratory for training models but couldn't succeed. So, we used Google Cloud Platform (GCP). Moreover, we used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard.

Datasets

In this paper we primarily used a single dataset: CIFAR-10. This dataset is publicly available and is used as standard datasets for benchmarking convolutional neural networks. CIFAR-10 dataset can be directly imported using Tensorflow and Pytorch. Due to the lack of time, code and computation resources, we could not able to reproduce the results for MobileNet dataset.

Experimental setup

For all the experiments, we trained the models on a NVIDA Tesla P100 with 16 GB RAM on Google Cloud Platform. All the experiments were conducted based on our public reimplementation repository available at repo link.

Results

We evaluate dynamic convolution on single architecture (ResNet), by using dynamic convolution for all convolution layers except the first layer. Each layer has K = 4 convolution kernels. The batch size is 128 and learning rate is 0.01. The table below provides the results of our run. As shown, our reproduced Vanilla ResNet-18 values are very similar to Dynamic ResNet-18. The runtime of Vanilla ResNet-18 is 53 mins whereas in Dynamic ResNet-18 the runtime increased to approximately 13 hours of training. Moreover, the parameters also increased substantially in the case of Dynamic ResNet-18 as compared to Vanilla ResNet-18. This same happens with Vanilla ResNet-34 and Dynamic ResNet-34.
Models Top-1 accuracy
Vanilla ResNet-18 87.89
Dynamic ResNet-18 87.66
Vanilla ResNet-34 87.59
Dynamic ResNet-34 87.94
Due to the lack of time, code and computation resources, we could not able to reproduce the results for Dynamic ResNet-50 so we can't validate the claims mentioned in the paper for same. For transparency and fair comparison, we even provide the validation and training curves of the model trained in the plots shown below. Additionally, we have also provide the graphs of the system statistics during the training.