Strip Pooling: Rethinking Spatial Pooling for Scene Parsing (CVPR 2020)

A Reproducibility Challenge 2020 report for the CVPR 2020 paper titled "Strip Pooling: Rethinking Spatial Pooling for Scene Parsing" by Qibin Hou, Li Zhang, Ming-Ming Cheng and Jiashi Feng.
Disha
Paper | GitHub Repository

Summary

Scope of Reproducibility

Advancing on the recent progress in attention mechanisms used in computer vision, the proposed Strip Pool showcases a novel way of constructing attention by using specific spatial orientation-based features motivated by the spatial representation of images in the task of scene parsing.

Methodology

We use the publicly accessible code of Strip Pool provided by the authors in their repository [1]. For our experiments, we first reconstruct the module using code for classification based models, specifically, the Residual network family of architectures. We replicated the model zoo using smaller scale CIFAR-10 and CIFAR-100 datasets for the task of image classification using NVIDIA T4, P100, and K80 GPUs respectively.

Results

While our results demonstrate that Strip Pool based attention (denoted as SPNets) improve performance as compared to baseline models (those having no attention method) in the CIFAR-10 classification task, it fell short by a significant margin in the CIFAR-100 classification task.
SPNets also didn't manage to beat the likes of cheaper alternatives like Efficient Channel Attention (ECA). To note, the paper doesn't compare the results to that of ECA since both ECA and SPNets were published at the same event (CVPR 2020).
It is also highly likely that these results are not clearly demonstrative of the efficacy of the module since it was designed specifically for scene parsing (in other words, a segmentation task). However, our experimental results do open a new avenue of discussion for the proposed Strip Pool mechanism.

What was easy

The paper was well structured and easy to read with clear details about experiment settings, hyper-parameters, and the ablation study. The proposed module (Strip Pool) was easily compatible with the family of networks we used for our experiments without the requirement of significant change.

What was difficult

Although the authors provide a publicly accessible implementation of their paper in their GitHub repository [1], the file structure was quite convoluted, making it difficult to locate the specific modules since there was a lot of repetitive acronyms and short forms used with no clear in-line comments about their meaning. Additionally, the code was fairly complex for us to get fully comfortable with. Additionally, due to the extensive compute requirements of SPNets, we were unable to reproduce the segmentation results presented in the paper because of the compute availability at our disposal.

Communication with original authors

We didn't face any substantial hurdles nor did we require further clarification, thanks to the clear details provided in the paper. Hence, we didn't feel the need to contact the original authors. For the issues that we faced regarding the complexity of the code base, we had our concerns addressed by the issues reported by other users in the repository which were responded to, by the authors.

Introduction

Attention mechanisms have proven to be an efficient add-on module for convolutional neural network architectures in various computer vision tasks. Since the inception of Squeeze-and-Excitation Networks (SENets), several new spatial attention mechanisms have been developed. This paper similarly proposes a new form of attention mechanism based on aggregating information along each of the spatial dimension separately––this is called a Strip Pool.
As compared to conventional attention mechanisms where the spatial attention is aggregated over a N \ast N space, the paper proposes the structure of bilateral aggregation of spatial information over N \ast 1 and 1 \ast N space by deconstructing the spatial input over each dimension by using Strip Pool. This allows the module to accurately model long range dependencies across each of the two spatial dimensions.
As part of the ML Reproducibility Challenge, we implement Strip Pool in our image classification framework and investigate the effect of the module introduced in various standard backbones and compare it to other standard attention mechanisms.

Scope of Reproducibility

We centered on the empirical evaluation and validation of Strip Pool based on the following two aspects:
  1. Do the results justify the computational overhead introduced by Strip Pool when used in classical ResNet backbones?
  2. How well does it compare to other standard attention mechanisms?

Methodology

The authors provide publicly accessible codebase for the implementation of the experiments conducted and reported in the paper. For our extensive evaluation, we ensure that the code is compatible with our image classification suite used to test on CIFAR-10 and CIFAR-100 datasets.

Strip Pool

The diagram above shows the structure of Strip Pooling (SP) module. For a given input tensor X \in \mathbb{R}^{C \ast H \ast W}, where C represents the channel dimension and H \ast W represents each of the spatial dimension, the input is passed through two parallel strip pool operators which output tensors X' \in \mathbb{R}^{C \ast H \ast 1} and X'' \in \mathbb{R}^{C \ast 1 \ast W}. Both these tensors are then passed through individual 1D convolution layers and then expanded to the size of the original input tensor using bilinear interpolation. Post expansion, the two resultant tensors are added element-wise and then subsequently passed through another point-wise convolution layer followed by a sigmoid layer which provides the attention weights. This tensor is finally element-wise multiplied to the original input to give the final output.
The Strip Pooling (SP) module in PyTorch is shown below:
import torchimport torch.nn as nnimport torch.nn.functional as Fclass SPBlock(nn.Module): def __init__(self, inplanes, outplanes): super(SPBlock, self).__init__() midplanes = outplanes self.conv1 = nn.Conv2d(inplanes, midplanes, kernel_size=(3, 1), padding=(1, 0), bias=False) self.bn1 = nn.BatchNorm2d(midplanes) self.conv2 = nn.Conv2d(inplanes, midplanes, kernel_size=(1, 3), padding=(0, 1), bias=False) self.bn2 = nn.BatchNorm2d(midplanes) self.conv3 = nn.Conv2d(midplanes, outplanes, kernel_size=1, bias=True) self.pool1 = nn.AdaptiveAvgPool2d((None, 1)) self.pool2 = nn.AdaptiveAvgPool2d((1, None)) self.relu = nn.ReLU(inplace=False) def forward(self, x): _, _, h, w = x.size() x1 = self.pool1(x) x1 = self.conv1(x1) x1 = self.bn1(x1) x1 = F.interpolate(x1, (h, w)) x2 = self.pool2(x) x2 = self.conv2(x2) x2 = self.bn2(x2) x2 = F.interpolate(x2, (h, w)) x3 = self.relu(x1 + x2) x3 = x * torch.sigmoid(self.conv3(x3)) return x3
Since the convolution kernels used in the module are non-depth-wise, they add a significant amount of additional parameters which causes huge memory overhead which proved to be a bottleneck in our experimental pipeline.

Experimental Settings

Models

For all the experiments conducted on the CIFAR-10 and CIFAR-100 datasets, we used the Residual Network family of architectures namely ResNet-18 and ResNet-34. To note, due to memory constraint, we only used ResNet-18 for CIFAR-100. For our ablation study on hyperparameter optimization using Weights & Biases Sweeps feature, we used only ResNet-18 based architecture on the CIFAR-10 dataset.

Datasets and Training

We conduct all the experiments on CIFAR-10 and CIFAR-100 datasets. On both the datasets, we trained all our models for a maximum of 200 epochs each, optimized by SGD with momentum using a multi-step decay learning rate policy by a factor of 10 at the 100th and 150th epoch in the training process. We apply a weight decay of 5e-4 and start the training with a base learning rate of 0.1. Additionally for the dataloaders, we used a batch size of 128 for all the benchmark experiments.

Hyper-parameter Optimization

For the hyper-parameter optimization experiments using Sweeps, we used the following parameters as defined in the following config file to obtain the results:
program: train_cifar.pymethod: bayesmetric: name: loss goal: minimizeparameters: att: values: ["Vanilla","Triplet","Strip Pool","ECA","GCT"] optimizer: values: ["adam", "sgd"] batch_size: values: [64, 128, 256]
For all the runs using Sweeps we used the ResNet-18 architecture on the CIFAR-10 dataset only.

Baselines

We compare our results using the same networks defined in the previous two sections but equipped with several commonly used attention mechanisms like Triplet Attention, Efficient Channel Attention (ECA) and Gated Channel Transformation (GCT). We discuss about the results in the next section.

Computational Requirements

We run all our experiments on three different clusters equipped with NVIDA P100, T4 and K80 GPUs. For each group, we used parallel nodes for faster runs. Additionally, we also utilized free compute resources like Google Colaboratory for extensive small scale experimentation. Each group of runs usually took on average 9 - 17 hours for completion.

Results

Benchmarks

For CIFAR-10 experiments as shown below, Strip Pooling performs fairly well across the board as compared to the baseline models getting the highest accuracy in ResNet-18 and second highest in ResNet-34 model. Both the SP-ResNet-18 and SP-ResNet-34 obtain the second lowest loss on CIFAR-10 shy off only ECA models. However, in case of CIFAR-100 classification task, Strip Pooling lags behind the baseline vanilla model securing the second lowest Top-1 accuracy and second highest loss. Combined with the unsatisfactory results, the high memory consumption doesn't demonstrate the method to be a strong candidate for optimal attention mechanism. However, as stated earlier, the method was primarily constructed for scene parsing segmentation task, thus, these results might not be conclusively reflective of the efficiency of the module. All the models were run for 5 times each and the mean and the standard deviation for each model is plotted below in the panels.

Sweeps

We further evaluated the efficiency and efficacy of Strip Pooling using the Sweeps feature in Weights & Biases. For the experiment we used a ResNet-18 model trained for 50 epochs for each run. As shown in the plot below, Vanilla (model with no attention) with a combination of 256 batch size and Adam optimizer obtains the lowest loss. This further adds to the speculation around the efficiency of the model and demands for further extensive experimentation which we plan to do in the future.