Gated Channel Transformation for Visual Recognition (CVPR 2020)

A Reproducibility Challenge report for the Gated Channel Transformation for Visual Recognition CVPR 2020 paper by Zongxin Yang, Linchao Zhu, Yu Wu, and Yi Yang.
Disha
Paper | GitHub Repository

Summary

Scope of Reproducibility

Plug-in attention modules have recently risen in popularity for computer vision since the release of Squeeze-and-Excitation Networks. This paper proposes a new form of channel gating transformation called Gated Channel Transformation (GCT) which explicitly models the channel relationships using defined control variables.

Methodology

We use the publicly accessible code of GCT provided by the authors in their repository [1]. For our experiments, we first reconstruct the module using code for classification based models, specifically, the Residual network family of architectures. We replicated the model zoo using smaller scale CIFAR-10 and CIFAR-100 datasets for the task of image classification using NVIDIA T4, P100, and K80 GPUs respectively.

Results

While most of the results we obtained in our experimentation conclusively suggest that GCT provides significant performance improvement at a lower computational overhead as the paper claims, there were a few results which didn't live up to that expectation which we will discuss in more details in the following sections.

What was easy

The method proposed in the paper (GCT) was fairly easy and straightforward to integrate with existing architectures like the Residual Network (ResNets) family used for the experiments in our reproducibility attempt.
Additionally, unlike many other attention mechanisms, GCT didn't provide significantly high memory constraints which allowed us to extensively evaluate and validate the efficiency of the method across a wide array of experiments.

What was difficult

We didn't face any significant hurdles in our reproducibility attempt of the paper. The only component that we felt to be a bottleneck was the high compute requirement to reproduce all the experiments from the paper. The paper lists experiments on ImageNet classification using deep architectures which would have required extensive compute clusters which weren't available to us during the time of this challenge.

Communication with original authors

We would firstly like to thank the authors of the paper for their delicate attention to detail in providing concise code for reproducing the experiments in the paper and also extending it to other experiments outside of the paper, and also for providing well-written explanation of each component of the method which made it very convenient to understand the effect and role of each component in the proposed method: GCT. Hence, we didn't feel the need of getting any clarifications from the authors, however, we would like to participate in open discourse with the authors for future research in the direction of research which embodies the intuition behind this paper.

Introduction

Channel Gating mechanisms have become a staple in computer vision, with many neural network architectures getting shipped by default with popular mechanisms like that of Squeeze-and-Excitation (SE) or Gather-Excite (GE) blocks.
However, these methods have a fair share of limitations, especially Squeeze-and-Excitation. Specifically, SE employs two fully-connected (FC) layers to process channel-wise embeddings and that has two distinct shortcoming. Firstly due to the computational complexity of these FC layers, the number of SE blocks that can be applied are limited. Additionally, the dimensionality of FC is reduced to balance out between the performance boost achieved and the tradeoff of computational overhead. Second, the relationships amongst the channels learnt by convolutions and FC are inherently implicit, resulting in agnostic behaviors of the neuron outputs.
Based on these drawbacks, the authors derive a new form of light-weight channel gating for efficient and accurate context modeling called as Gated Channel Transform (GCT) which takes inspiration from the effect of normalization which can create a competition amongst the channels while at the same time accelerating the learning and smoothing the gradients. The GCT is made up of three different components which we will discuss in more details in the subsequent sections:
  1. Global Context Embedding
  2. Channel Normalization
  3. Gating Adaptation

Scope of Reproducibility

Considering the paper proposes a new form of channel gating mechanism based on empirical intuitive motivation, our attempt of reproducing the paper was primarily focused on evaluating the impact of the GCT module when used in standard computer vision tasks from the perspective of robustness, consistency and parameter efficiency.

Methodology

The authors provide publicly accessible codebase for the implementation of the experiments conducted and reported in the paper. For our extensive evaluation, we ensure that the code is compatible with our image classification suite used to test on CIFAR-10 and CIFAR-100 datasets.

GCT

The above schematic representation of the GCT module distinctively showcases the three different components as mentioned in the prior sections - Global Context Embedding, Channel Normalization, and Gating Adaptation. We discuss each of those modules in greater details in subsequent sections.

Global Context Embedding

Global Context Embedding (GCE) is responsible for aggregating global contextual information in the channels of the input tensor. Unlike in case of SE and GE, where Global Average Pooling (GAP) is used for global information aggregation, GCT uses l_p norm and based on their experiments, specifically the l_2 norm (More details on this can be found in the original text).
The reason for avoiding GAP is that if SE is used after Instance Normalization (IN) layers which is integral component in style transfer task then the output of GAP will be constant same values for any input since IN fixes the mean of any channel of features.
GCE uses trainable parameters \alpha_c to control the weight of each channel. Thus, if \alpha_c is close to 0 for a particular channel then that channel is of less significance and will not be involved in the channel normalization.
GCE can be represented by the following mathematical formulation:
s_c = \alpha_c ||x_c||_2 = \alpha_c\{[\sum^{H}_{i=1} \sum^{W}_{j=1}(x_c^{i,j})^{2}] + \epsilon \}^\frac{1}{2}
where \epsilon is a small constant to avoid derivation issue at zero.

Channel Normalization

Inspired by Local Response Normalization (LRN), Channel Normalization in GCT also uses a l_2 norm across the channels. This can be represented by the following mathematical formulation:
\hat{s}_c= \frac{\sqrt{C}s_c}{||s||2}=\frac{\sqrt{C}s_c}{[(\sum_{c=1}^{C}s_c^{2})+\epsilon]^\frac{1}{2}}
The scalar \sqrt{C} is used to normalize the scale of \hat{s}_c, thus avoiding a too small scale of \hat{s}_c when C is large. More importantly, channel normalization has a much lower complexity O(C) as compared to the complexity of FC layers used in SE which is O(C^2).

Gating Adaptation

Finally, the gating weights \gamma and biases \beta are applied on the input channels using the following formula:
\hat{x}_c=x_c[1 +tanh(\gamma_c\hat{s}_c+\beta_c)]
The reason for using 1 + tanh in place of sigmoid is that in case of latter, vanishing gradient can easily occur during training when the values are closer to 0 or 1 due to the derivative of sigmoid at that point being close to 0. To alleviate this problem, 1 + tanh was used which gives GCT the ability to model identity mapping and makes the training more stable.
Overall, GCT can be defined using the following code snippet in PyTorch:
import torchimport torch.nn as nnclass GCT(nn.Module): def __init__(self, num_channels, epsilon=1e-5, mode='l2', after_relu=False): super(GCT, self).__init__() self.alpha = nn.Parameter(torch.ones(1, num_channels, 1, 1)) self.gamma = nn.Parameter(torch.zeros(1, num_channels, 1, 1)) self.beta = nn.Parameter(torch.zeros(1, num_channels, 1, 1)) self.epsilon = epsilon self.mode = mode self.after_relu = after_relu def forward(self, x): if self.mode == 'l2': embedding = (x.pow(2).sum((2,3), keepdim=True) + self.epsilon).pow(0.5) * self.alpha norm = self.gamma / (embedding.pow(2).mean(dim=1, keepdim=True) + self.epsilon).pow(0.5) elif self.mode == 'l1': if not self.after_relu: _x = torch.abs(x) else: _x = x embedding = _x.sum((2,3), keepdim=True) * self.alpha norm = self.gamma / (torch.abs(embedding).mean(dim=1, keepdim=True) + self.epsilon) else: print('Unknown mode!') sys.exit() gate = 1. + torch.tanh(embedding * norm + self.beta) return x * gate

Experimental Settings

Models

For all the experiments conducted on the CIFAR-10 and CIFAR-100 datasets, we used the Residual Network family of architectures namely ResNet-18 and ResNet-34. To note, due to memory constraint, we only used ResNet-18 for CIFAR-100. For our ablation study on hyper-parameter optimization using Weights & Biases Sweeps feature, we used only ResNet-18 based architecture on the CIFAR-10 dataset.

Datasets and Training

We conduct all the experiments on CIFAR-10 and CIFAR-100 datasets. On both the datasets, we trained all our models for a maximum of 200 epochs each, optimized by SGD with momentum using a multi-step decay learning rate policy by a factor of 10 at the 100th and 150th epoch in the training process. We apply a weight decay of 5e-4 and start the training with a base learning rate of 0.1. Additionally for the data loaders, we used a batch size of 128 for all the benchmark experiments.

Hyper-parameter Optimization

For the hyper-parameter optimization experiments using Sweeps, we used the following two set of parameters as defined in the following config file to obtain the results:
  1. Sweeps 1:
program: train_cifar.pymethod: bayesmetric: name: loss goal: minimizeparameters: att: values: ["Vanilla","Strip Pool","GCT"] optimizer: values: ["adam", "sgd"] batch_size: values: [64, 128, 256]
2. Sweeps 2:
program: train_cifar.pymethod: bayesmetric: name: loss goal: minimizeparameters: att: values: ["Vanilla","Triplet","GCT"] optimizer: values: ["adam", "sgd"] batch_size: values: [64, 128, 256]
For all the runs using Sweeps we used the ResNet-18 architecture on the CIFAR-10 dataset only.

Baselines

We compare our results using the same networks defined in the previous two sections but equipped with several commonly used attention mechanisms like Triplet Attention, Efficient Channel Attention (ECA) and Gated Channel Transformation (GCT). We discuss about the results in the next section.

Computational Requirements

We run all our experiments on three different clusters equipped with NVIDA P100, T4 and K80 GPUs. For each group, we used parallel nodes for faster runs. Additionally, we also utilized free compute resources like Google Colaboratory for extensive small scale experimentation. Each group of runs usually took on average 9 - 17 hours for completion.

Results

Benchmarks

As shown in the results below, in both CIFAR-10 and CIFAR-100 across all the variants of ResNets used, GCT provided competitive results, however, wasn't top of the charts in any of the categories.
While it was significantly cheaper and faster as compared to other attention mechanisms, GCT couldn't demonstrate the ability to beat SOTA algorithms on our CIFAR-10 and CIFAR-100 benchmarks. This also might be because of the fact that CIFAR-10 and CIFAR-100 are significantly smaller and less complex tasks than the ones used in the paper to evaluate GCT. All the models were run for 5 times each and the mean and the standard deviation for each model is plotted below in the panels.

Sweeps

Sweep-1

For our first sweep run, we used the baseline model and Strip Pool (CVPR 2020) to compare with GCT. The reason we did so was to isolate Strip Pool and Triplet Attention (WACV 2021), both of which are on an abstract level quite similar. GCT did perform strongly and obtained a significant lower loss of 0.5152 as compared to other runs when used with batch size of 256 and the SGD optimizer.

Sweep-2

For our second sweeps run, we used the baseline, triplet attention and GCT as the three different variants to compare. Surprisingly, as compared to our first sweep, in the parameter importance chart, GCT obtained the highest importance value. The lowest loss obtained by GCT - 0.5514 coupled with SGD optimizer and a batch size of 128.