Reproducibility Summary

This report validates the reproducibility of the CVPR 2020 paper "ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks" by Wang et al. (2020). It covers each aspect of reproducing the results and claims put forth in the paper. The paper primarily presents a novel attention mechanism that can be plugged in to standard convolution neural networks (CNNs). Although the experiments were simple to reproduce, the paper had some distinctive flaws and a lack of clarity that proved to be a barrier throughout the process of reproducing the paper.

Scope of Reproducibility

The paper proposes a novel channel attention method named Efficient Channel Attention (ECA) that aims to get competitive performance at the lowest overhead cost. The central foundation behind ECA is Local Cross Channel Interaction (Local CCI), which computes attention for a query channel in respect to other channels in the local neighbourhood while not having any Dimensionality Reduction (DR), and is demonstrated to be crucial for the success of the attention mechanism's efficiency.

Methodology

For reproducing the paper, we initially used the original implementation provided by the authors. However, due to severe limitations and incompleteness of the provided original repository, we had to reimplement several parts of the paper and add more experiments for extra validation. We used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard. In regards to compute, we used Tesla T4, P4 and K80 GPUs provided in Google Colaboratory along with a single Tesla V100 GPU on GCP.

Results

Due to high computational requirements, we were not able to replicate all of the experiments conducted and presented in the paper. However for image classification tasks we were able to obtain results which aligned with the efficacy of the mechanism proposed in the paper. Additionally, for object detection, we were able to reimplement the Mask RCNN model used in the paper, using the dependencies and hyper-parameters specified by the original authors. However, the results obtained didn't match those provided in the paper and were 2~5% behind the reported values. We suspect this might be due to high learning rate during the initial part of the training, since the training setting was designed for an eight-GPU environment while it was being benchmarked on a single GPU. Overall, the results obtained from the reimplementation do support the claims on the efficiency of the attention mechanism made in the paper.

What was easy

Because of a simplistic design of the attention mechanism, it was fairly easy to implement it from scratch and validate its efficiency on simple small scale experiments. The neural network architectures used in the paper were common baselines available in different standard deep learning open source frameworks, making it easy to implement the whole structure. The hyper-parameters were clearly mentioned in the paper, which made it easy to replicate the runs.

What was difficult

There were incorrect claims made in the paper and discrepancies in the code provided by the authors that contradicted with their statements made in the paper. Additionally, the trained weights provided in the original repository didn't match the keys required to do simple inference using the framework that the authors had used to obtain their results as per the paper. Lastly, the authors did not provide training code for the object detection models used in their paper, which made it difficult to replicate them, resulting in retraining from scratch.

Communication with original authors

The authors have been so far unresponsive to the issues opened on their repository corresponding to the incorrect claims made in the paper along with requests to provide accurate training code to ensure reproducibility.

Introduction

This reproducibility submission is an effort to validate the CVPR 2020 paper by Wang et al. (2020), titled "ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks". This report evaluates the central claim of the paper, which proposes a novel high-performing and light-weight channel attention mechanism for deep neural networks used in the domain of computer vision. We provide strong evidence of contradictory claims made in the paper along with additional ablation results. We also provide a public dashboard to view all the reproduced results and experiments, including the codebase used to run those experiments.

Scope of Reproducibility

The paper proposes a novel light-weight channel attention mechanism called Efficient Channel Attention (ECA) that outperforms or matches the performance shown by other standard attention mechanisms used in different computer vision tasks like image classification, instance segmentation, and object detection. The paper takes inspiration from the structural design of the popular "Squeeze-and-Excitation Networks" (SENets) and makes subtle modifications to it to accommodate their central claim of prioritisation of Local Cross Channel Interaction within the attention framework.
The paper is built on the following central claims:

Dimensionality Reduction is not desirable

In most standard channel attention mechanisms like Squeeze-and-Excitation Networks (SENets), Convolutional Block Attention Module (CBAM) and Global Context Networks (GCNets), the process of generating the channel attention weights for the query channel involves passing the Global Average Pooled (GAP) tensor of shape C \ast 1 \ast 1 through a pair of fully connected layers with a bottleneck to reduce the number of channels to a small factor (usually 16) of the original number of channels in the tensor. The paper defines this process as Dimensionality Reduction and states this to be an undesirable property in the mechanism of computing channel attention weights. The paper conducts a series of ablation experiments to prove this claim, which serves as the primary motivation for the next central claim.

Local Cross Channel Interaction is important and efficient

Based on results that demonstrated that dimensionality reduction is not optimal, the authors state that cross channel interaction is crucial in the efficiency of a channel attention mechanism. Essentially, cross channel interaction is defined as the process of computing the attention weight for a query channel by taking into consideration of all other channels in the tensor. This is called Global Cross Channel Interaction. Although this is optimal, this is extremely expensive since the parametric complexity is in the order of O(n^2). The authors propose an efficient version of cross channel interaction called Local Cross Channel Interaction (Local CCI) that involves computing the attention weight for a query channel by taking into consideration a dynamic set of channels in the vicinity of the query channel. The authors achieve this with a simple 1D Convolution kernel whose weights are shared across all channels, thus reducing the number of parameters to the order of k where k \lll C. This essentially ensures that ECA is light-weight in terms of both the parameter overhead and the floating point operations (FLOPs) complexity added.

Adaptive coverage of Local Cross Channel Interaction provides higher efficiency and robustness

Building on Local CCI, the authors propose a novel formulation of calculating the coverage of the locality around the query channel, that the kernel size k for the 1D convolution kernel is based on the number of channels C, which can be mathematically defined as:
k = \psi(C) = |\frac{\log2(C)}{\gamma} + \frac{b}{\gamma}|_{\textit{odd}}
where \gamma and b are two hyper-parameters set to 2 and 1 respectively for all experiments in the paper. The authors also compare the efficiency of this adaptive kernel size to manual tuning in different models on the ImageNet classification task.

Methodology

For initial understanding and clarity we investigated the original implementation provided by the authors at https://github.com/BangguWu/ECANet. To build the full experiment pipeline for image classification-based experiments, we made modifications to their codebase and included support for experiment tracking via Weights & Biases, along with a few additional open source packages like Echo and PTFlops for ablation studies to verify and validate the accuracy of the claims made in the paper.
We used freely available compute resources like Google Colaboratory for training models on small scale datasets for ablation studies. Along with Colab, we also used Google Cloud Platform (GCP) to run the object detection and instance segmentation models.

Model Descriptions

The paper conducted extensive experiments in the domain of computer vision, including image classification on ImageNet and object detection and instance segmentation on MS-COCO. For these experiments, the paper used standard deep convolutional neural network architectures like ResNet-50, ResNet-101, ResNet-152, and MobileNet v2 for the image classification experiments. For the object detection experiments it used Detectors like Fast RCNN, Mask RCNN, and RetinaNet with the trained ResNet-50 and ResNet-101 backbones. For instance segmentation it used Mask RCNN with the previously trained ResNet-50 and ResNet-101 architectures. All the backbone architectures were modified to include the ECA module in the bottleneck block prior to training.

Datasets

The paper primarily used two datasets: ImageNet for image classification and MS-COCO for object detection and instance segmentation. Both these datasets are publicly available and are used as standard datasets for benchmarking convolutional neural networks. We used the following two resources for accessing these datasets for running the experiments:
For small-scale experiments, we additionally used the CIFAR-10 dataset, which can be directly imported using Tensorflow and Pytorch.

Hyper-parameters

Throughout all our experiments, we used the same hyper-parameters as in the paper. The authors clearly state all hyper-parameters to train the models in the experiments.

Experimental setup

For all the small-scale image classification experiments, we trained the models and conducted ablation studies using the free GPU resources provided by Google Colaboratory. The GPUs we used on Colab included a NVIDIA K80, NVIDIA Tesla T4, and NVIDIA P100.
For the large-scale object detection and instance segmentation experiments, we trained the model on a NVIDA Tesla V100 on Google Cloud Platform.
All the experiments were conducted based on our public reimplementation repository available at https://github.com/digantamisra98/Reproducibilty-Challenge-ECANET.

Computational requirements

For the large-scale object detection and instance segmentation experiments, a single NVIDIA V100 GPU would be capable of training the Mask RCNN-ECANet-50 model in the paper within 2 days and 11 hours. Additionally, this would require a memory of 200 GB to store the MS-COCO dataset.
For the large-scale image classification experiments on ImageNet, 4 NVIDIA V100 GPUs would be capable of training the ResNet-based models in 4-5 days and the MobileNet-based models in 3-4 days.

Results

We reimplemented the Mask RCNN-ECANet-50 model used in the paper to showcase the efficiency of the ECA module in object detection and instance segmentation on the MS-COCO dataset. We additionally did ablation experiments on the CIFAR-10 image classification task to validate.

Object detection and instance segmentation

We reproduced the Mask RCNN-ECANet-50 model for object detection and instance segmentation using the exact training settings provided in the paper. We used the ECANet-50 trained weights provided by the authors along with the open source MMDetection framework to train the Mask R-CNN model in accordance to the information provided in the paper.

Reproduced Results

Backbone Detectors BBox_AP BBox_AP<sub>50</sub> BBox_AP<sub>75</sub> BBox_AP<sub>S</sub> BBox_AP<sub>M</sub> BBox_AP<sub>L</sub> Segm_AP Segm_AP<sub>50</sub> Segm_AP<sub>75</sub> Segm_AP<sub>S</sub> Segm_AP<sub>M</sub> Segm_AP<sub>L</sub>
ECANet-50 Mask RCNN 34.1 53.4 37.0 21.1 37.2 42.9 31.4 50.6 33.2 18.1 34.3 41.1

Original Results:

Backbone Detectors BBox_AP BBox_AP<sub>50</sub> BBox_AP<sub>75</sub> BBox_AP<sub>S</sub> BBox_AP<sub>M</sub> BBox_AP<sub>L</sub> Segm_AP Segm_AP<sub>50</sub> Segm_AP<sub>75</sub> Segm_AP<sub>S</sub> Segm_AP<sub>M</sub> Segm_AP<sub>L</sub>
ECANet-50 Mask RCNN 39.0 61.3 42.1 24.2 42.8 49.9 35.6 58.1 37.7 17.6 39.0 51.8
The results we obtained were far off from the results provided in the paper. They were even lower than the vanilla baseline and thus could not validate the claims made by the paper. We have provided the logs of the complete training in our reimplementation repository.

CIFAR-10 Image classification

To validate on the CIFAR-10 dataset, we trained a ResNet-18 with various attention mechanisms including Squeeze-and-Excitation, Triplet Attention, CBAM and ECA. Each model was run for 5 runs of 50 epochs at a batch size of 128 and a learning rate of 0.1 using the SGD optimiser.
The results demonstrate that ECANet obtained the second highest mean Top-1 Accuracy, after CBAM. Additionally ECANet obtained the highest Top-1 Accuracy across all runs as compared to all other variants. However, ECANet also obtained the highest standard deviation across the runs. Although the CIFAR-10 classification task isn't a strong indicator of performance at only 50 epochs, the results do suggest that ECANets are able to obtain results that are competitive with other standard attention mechanisms while being much cheaper in terms of parametric cost.

Hyper-parameter Sweeps

We also conducted hyper-parameter sweeps for various attention mechanisms to find the best hyper-parameters and to understand which attention mechanism is most robust. We chose batch size, optimiser, and choice of attention mechanism as the range of hyper-parameters with the goal of minimising loss. In the ResNet-20 model, ECA combined with SGD across 128 and 256 batch sizes obtained the lowest loss in the 10 epoch run. Based on the results, ECANets demonstrated that they are efficient and robust in terms of hyper-parameter changes. Additionally, as per the parameter importance chart, it can be observed that the ECA attention mechanism had the highest importance with the most negative correlation to loss when compared to other attention mechanisms, justifying the efficiency of ECA as reported in the paper.

Discussion

Based on the results we obtained, it is not clear if the claims made in the paper hold true. While the image classification results support and validate the efficiency of the novel attention mechanism ECA proposed in the paper, the object detection and instance segmentation results are much lower than reported in the paper. The additional hyper-parameter sweep experiment helped to verify the efficacy of the model in image classification tasks.
One critical point is the failure of the authors to compare their results to "SRM : A Style-based Recalibration Module for Convolutional Neural Networks" by Lee et al. (2019), which proposes a very similar module and in fact generalises Local CCI. Moreover, SRM was published prior to the date of publication of the Wang et al. (2020) paper. The main difference between ECANets and SRM is that the latter uses a combination of GAP and Global Standard Deviation based-Pooling and the former uses only GAP. Additionally, Lee et al. (2019) focus on SRMs for style manipulation in images, while Wang et al. (2020) propose ECANets as a general plug-in module for deep CNNs.
Due to both time and computational constraints, we were not able to reproduce all the large-scale experiments in the paper, which served as a roadblock in doing a complete assessment of the claims presented in the paper.

What was easy

Overall the structure of the paper was very easy to follow. The paper had very minimal prerequisites in mathematical understanding to go through the theoretical foundation of the paper. The paper also clearly mentioned all hyper-parameters used in all the experiments and clearly defined every experimental setting, making it easier to reimplement them. Additionally, the original implementation provided by the authors served as a helpful resource for reimplementing the ImageNet classification experiments.

What was difficult

The paper contains errors regarding the central claim on the Coverage of Local Cross Channel Interaction. This issue has been clearly described and communicated to the authors in this open thread on their original repository, but has not been addressed yet.
The codebase provided by the authors used different hyper-parameters than the ones they claimed to use in their paper.
The authors did not provide the code for reimplementing their object detection and instance segmentation experiments. Using the trained weights provided by the authors, simple inference provided faulty predictions. This has been detailed in this open issue on their original repository, but has not been addressed yet.

Communication with authors

We communicated concerns and queries related to the claims and experiment settings to the authors. However, till date, we have not received any response from the authors in clarification to our concerns. Several other users have opened similar threads, which have also yet to be addressed.

Conclusion

Although there are major concerns surrounding the claims made by the authors of the paper, the experimental results do to an extent validate the efficiency of the paper and strongly suggests that ECANets can provide competitive results at a much lower computational overhead cost. However, we would like to encourage the authors to provide more transparency and clarification regarding the claims and the experimental settings mentioned in their paper.