Dynamic Group Convolution for Accelerating Convolutional Neural Networks (ECCV 2020)

A reproduction of the paper 'Dynamic Group Convolution for Accelerating Convolutional Neural Networks ' by Zhuo Su et. al, accepted to ECCV (2020).
Sachin Malhotra

Reproducibility Summary

This is a report for reproducibility challenge of ECCV 2020 paper "Dynamic Group Convolution for Accelerating Convolutional Neural Networks" by Zhuo Su et. al. It covers each aspect of reproducing the results and claims put forth in the paper. This paper primarily proposes a novel idea which is dynamic group convolution (DGC) that adaptively selects which part of input channels to be connected within each group for individual samples on the fly. DGC preserves the original network structure and has similar computational efficiency as the conventional group convolution. The authors have released the code implementation which can be seen on this Github repository. This report aims at replicating their reported results, additionally by using wandb library to track various hyper-parameters during the training and validation phases.

Scope of Reproducibility

Deep convolutional neural networks (CNNs) have achieved significant successes in a wide range of computer vision tasks including object detection, image classification and semantic segmentation. Previous study found that deeper and wider networks can obtain better performance but they requires big and complex models. These models are very compute-intensive making them impossible to deploy them on edge devices (e.g. routers, routing switches) with limited computing resources. The existing group convolutions have two disadvantages:
  1. They weaken the representation capability of the normal convolution by introducing sparse neuron connections and suffer from decreasing performance especially for those difficult samples.
  2. They have fixed neuron connection routines, regardless of the specific properties of individual inputs.
The authors motivated by the dynamical computation mechanism in dynamic networks and thus proposed dynamic group Convolution (DGC).

Methodology

For reproducing the paper, we used the original implementation provided by the authors on their Github repository. The paper represents various useful models and datasets used for the implementation of DGC. We made changes to their codes and included the support for experiment tracking via Weights & Biases API. For our experiments, we used a NVIDA Tesla P100 GPU with 16 GB RAM on the Google Cloud Platform (GCP). Moreover, we used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard. We reproduced the results mainly on CIFAR 10 and CIFAR 100 dataset due to lack of time and resources.

Results

We were able to reproduce the main results reported in the paper using the GPUs. As suggested in the paper we use both CIFAR-10/CIFAR-100 datasets. The two CIFAR datasets both include 50k training and 10k test images of size 32 px × 32 px and comprises 10 classes for CIFAR-10 and 100 classes for CIFAR-100. Moreover, as mentioned we train for 300 epochs for both datasets.

What was easy

The structure of the paper was quite relevant as it displayed the very need of the working of DGC and how it was unique from already existing designs. The method proposed in the paper (DGC) was fairly easy and straightforward to integrate with existing architectures like the Residual Network (ResNets), CondenseNet and MobileNetV2 used for the experiments in our reproducibility attempt. The references have been provided in the paper to make it easier to reimplement. The authors provided with the original implementation which was useful for the several experiments.

What was difficult

We didn't face any significant hurdles in our reproducibility attempt of the paper. The only component that we felt to be a bottleneck was the high compute requirement to reproduce all the experiments from the paper. The paper lists experiments on ImageNet, CIFAR10 and CIFAR100 classification using deep architectures. ImageNet dataset consists of 1.28 million training images and 50k validation images from 1000 classes that would have required extensive compute clusters which weren't available to us during the time of this challenge.

Communication with original authors

We would like to thank the authors of the paper for their delicate attention to detail in providing concise code. We would also like to extend our gratitude to the authors for performing experiments outside the scope of this paper in the ablation studies. They have provided well-written explanation of each component which made it very convenient to understand the effect. Hence, we didn't feel the need of getting any clarifications from the authors.

Introduction

Figure: Overview of a DGC layer
Dynamic Group Convolution (DGC) can adaptively select which part of input channels to be connected within each group for individual samples on the fly. Specifically, the authors equip each group with a small feature selector to automatically select the most important input channels conditioned on the input images. Multiple groups can adaptively capture abundant and complementary visual/semantic features for each input image. The DGC preserves the original network structure and has similar computational efficiency as the conventional group convolutions simultaneously. They extensively experiments on multiple image classification benchmarks including CIFAR-10, CIFAR-100 and ImageNet demonstrate its superiority over the exiting group convolution techniques and dynamic execution methods.

Scope of Reproducibility

Dynamic group convolution (DGC) used to adaptively select the most related input channels for each group while keeping the full structure of the original networks. Specifically, they introduce a tiny auxiliary feature selector for the each group to dynamically decide which part of the input channels to be connected based on the activations of all of input channels.
The scope of our reproducibility is encapsulated in a :
  1. Reproduce and validate the performance of the DGC model using the same training settings as that used in the paper.
  2. To see whether the proposed DGC can be easily optimized with any exiting networks in an end-to-end manner.

Methodology

For initial understanding and clarity we investigated the original implementation provided by the authors at https://github.com/zhuogege1943/dgc. To make their code run, we installed the missing libraries along with the versions of libraries that they were using and made modifications to their codebase. We try to use freely available compute resources like Google Colaboratory for training models but couldn't succeed. So, we used Google Cloud Platform (GCP). Moreover, we used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard.

Datasets

In this paper we primarily used two datasets: CIFAR10 and CIFAR100. These dataset are publicly available and are used as standard datasets for benchmarking convolutional neural networks. We used the CIFAR-10 & CIFAR-100 dataset, which can be directly imported using Tensorflow and Pytorch.

Experimental Setup

For all the experiments, we trained the models on a NVIDA Tesla P100 with 16 GB RAM on Google Cloud Platform. All the experiments were conducted based on our public reimplementation repository available at repo link.

Results

We evaluate DGC on single architecture (ResNet-18) as the baseline model and replace the two 3 x 3 convolution layers in each residual block with the proposed DGC. All models are optimized using stochastic gradient descent (SGD) with Nesterov momentum with a momentum weight of 0.9, and the weight decay is set to be 10^{(-4)}. Mini batch size is set as 64 for CIFAR dataset. By default we use 4 groups (heads) in each DGC layer. We ran dydensenet on both CIFAR-10 and CIFAR-100 dataset. The validation Top-1 accuracy in CIFAR-10 is 95.13 whereas it is 76.15 in the CIFAR-100. The Top-5 accuracy in CIFAR-10 is approximately 100 whereas in CIFAR-100 it's 93.5. Due to the lack of time, code and computation resources, we could not able to reproduce the results for ImageNet dataset so we can't validate the claims mentioned in the paper for same. For transparency and fair comparison, we even provide the validation and training curves of the model trained in the plots shown below. Additionally, we have also provide the graphs of the system statistics during the training.