Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets (CVPR 2020)

Submission for Reproducibility Challenge 2021 for the paper "Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets" by Daniel Haase and Manuel Amthor.
Abhay Puri

Reproducibility Summary

This report endorses the reproducibility of the CVPR 2020 paper "Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets" by Manual Amthor and Daniel Haase. This paper mainly presents Depthwise separable convolution (DSC’s) as the basic component of the convolution neural network (CNN). The subspace transforms performed initially by the linear bottlenecks, which were obtained due to the extensions of BSConv, were improved with the help of regularization loss. This paper also reveals that DSC-based architectures such as MobileNets implicitly rely on cross-kernel correlations whereas BSConv formulation is based on intra-kernel correlations. In addition, the gain in performance and model efficiency can see a rise when standard convolutions and other architectures is replaced by the BSConv. The authors have released the code implementation which can be seen on this Github repository.

Scope of Reproducibility

This paper deals with the convolution layers of the kernels through backpropagation. After analyzing these correlations, a new parameter-efficient version of convolution layers are derived. The property of filtration of CNN’s can be obtained by transforming the BSConv approximations. The main idea of BSConv is to exploit that kernels of CNNs usually show high redundancies along their depth axis (intra-kernel correlations). Thus, BSConv represents each filter kernel using one 2d blueprint which is distributed along the depth axis.

Methodology

The entire framework is provided by the authors in their Github Repository. The paper represents various useful models and datasets used for the implementation of BSConv. We made changes to their codes and included the support for experiment tracking via Weights & Biases API. For our experiments, we used a NVIDA Tesla P100 GPU on the Google Cloud Platform (GCP). Moreover, we used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard.

Results

We were able to reproduce the main results reported in the paper using the GPUs. As suggested in the paper we use the CIFAR-100 dataset which consists of 50k training and 10k test images of size 32 px \times 32 px and comprises 100 classes. Moreover, as mentioned we train for 200 epochs for CIFAR-100 dataset. We used SGD with momentum set to 0.9 and the weight decay of 10^{-4}. The initial learning rate is set to 0.1 and decayed by a factor of 0.1 at epochs 100, 150 and 180.

What was easy

The paper was understandable and it was quite fascinating to follow its structure. Alongwith the theoretical concept, the mathematical equations provided ease to reformulate the paper. The hyper-parameters and the references from where the concept had been derived were a major advantage, making it easier to reimplement. In addition to this, the authors provided with the original implementation which was useful for the several experiments.

What was difficult

The only bottleneck we faced was the lack of compute resources which prevented us from experimenting with all the variants of MobileNet specified in the paper.

Communication with the Authors

We would like to thank the authors of the paper for their delicate attention to detail in providing concise code. We would also like to extend our gratitude to the authors for performing experiments outside the scope of this paper in the ablation studies. They have provided well-written explanation of each component which made it very convenient to understand the effect. Hence, we didn't feel the need of getting any clarifications from the authors.

Introduction

Previously, improvements of CNNs were mainly driven by increasing the model capacity and at a same time ensuring a proper training behaviour. Recently, this has led to the development of models with half a billion parameters. in practical applications, the computational capacity is often limited, especially in mobile and automotive contexts. Based on quantitative and qualitative analyses of trained CNNs, the authors proposed blueprint separable convolutions (BSConv). The main idea behind BSConv is to utilize that kernels of CNNs which usually show high redundancies along their depth axis (intra-kernel correlations). Thus BSConv represents each filter kernel using one 2d blueprint which distributed along the depth axis using a weight vector ( see Figure). However, the solution provided by the authors directly implies the use of an additional regularization loss to improve the subspace transform implicitly performed by these bottlenecks.

Scope of Reproducibility

The paper deals with the convolution layers of the kernels through backpropagation. After analyzing these correlations, a new parameter-efficient version of convolution layers are derived. The property of filtration of CNN’s can be obtained by transforming the BSConv approximations. This can be illustrated in the mathematical form as follows:
The authors define each filter kernel F^{{(n)}} to be represented using a blueprint B^{{(n)}} and the weights W_{n,1,.......,}W_{n,M} \in R via
F^{{(n)}}_{m,:,:} = w_{n,m}. B^{{(n)}}
with m \in {1,...., M} and n \in {1,..., N}. However, in contrast to standard convolution layers which have M.N.K^2 free parameters, the BSConv variant only has N.K^2 parameters for the blueprints.
Besides these, the main concentration points of the paper involves:
  1. The variants derived from BSConv : BSConv-U, BSConv-S.
  2. The relation of DSC and linear inverted residual bottlenecks to the derived variants.

The variants derived from BSConv

The relation of DSC and linear inverted residual bottlenecks to the derived variants:

Methodology

For initial understanding and clarity we investigated the original implementation provided by the authors at Github. We made modifications to their codebase and included support for experiment tracking via Weights & Biases API. We used freely available compute resources like Google Colaboratory for training models on small scale datasets for ablation studies. Along with Colab, we also used Google Cloud Platform (GCP) to run the object detection and instance segmentation models.

Datasets

This paper primarily used a single dataset: CIFAR100. This dataset is publicly available and is used as standard datasets for benchmarking convolutional neural networks. CIFAR-100 dataset can be directly imported using Tensorflow and Pytorch. Due to the lack of time, code and computation resources, we could not able to reproduce the results for MobileNet dataset.

Experimental result

For all the experiments, we trained the models on a NVIDA Tesla P100 with 16 GB RAM on Google Cloud Platform. Moreover, we used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard.
All the experiments were conducted based on our public reimplementation repository available at repo link.

Results

We reproduced the results of BSConv on CIFAR-100 dataset specifically. We consider PreResNets, ResNets and WideResNets as three state-of-the-art-models for the CIFAR-100 dataset. We used both BSConv-U and BSconv-S variants. The initial learning rate is set to be 0.01. The run set table below provides the results of our run. As shown, our reproduced ResNet-110 (BSConv-U) value matches the value reported in the paper. Moreover, we have also used BSConv-S variants and added the regularization loss to our classification loss which is alpha and we used two values of alpha which is 0.1 and 0.25. Our reproduced results strongly validate the reproducibility of BSConv models. For transparency and fair comparison, we even provide the validation and training curves of the model trained in the plots shown below. Additionally, we have also provide the graphs of the system statistics during the training. Due to the lack of time, code and computation resources, we could not able to reproduce the results for MobileNet dataset so we can't validate the claims mentioned in the paper for same.