Funnel Activation for Visual Recognition (CVPR 2020)

Reproducibility Challenge 2020 report for the CVPR 2020 paper titled 'Funnel Activation for Visual Recognition' by Ningning Ma, Xiangyu Zhang, and Jian Sun.
Disha
Paper | GitHub Repository

Summary

Scope of Reproducibility

Expanding the family of rectified activations, the proposed function (Funnel Activation) demonstrates a novel way of constructing activations by preserving higher activated features by involving a bilevel input to a simple max function which improves expressivity at a small computational overhead cost.

Methodology

We use the open-source publicly available implementation by the authors of Funnel Activation [1] though we first translated it from MegEngine format to PyTorch format which was our framework of choice for all our experiments. We replicated the model zoo used in the paper using smaller scale CIFAR-10 and CIFAR-100 datasets for the task of image classification using NVIDIA T4, P100, and K80 GPUs respectively.

Results

We successfully reproduced the efficacy of the proposed activation function, Funnel, using different variants of the Residual Network (ResNet) family of architectures on the image classification task on the CIFAR-10 and CIFAR-100 dataset. In our experiments, Funnel consistently showcased significant improvement when compared to other activation functions from the paper. The results consistently showcased that the central claim of introducing a 2D spatial condition based pixel modeling in the computation of activation function significantly improves performance.
Furthermore, we also conducted hyper-parameter optimization as part of our ablation study where we observed the stability of Funnel when used with various other hyper-parameters is consistent with the results observed in the paper and the intuition provided in the same.

What was Easy

The authors provided well documented and concise supporting code for all the experiments in their paper which made it easier to verify the correctness of our reimplementation of the same in PyTorch. The paper also accurately mentioned all experimental settings and hence didn't pose any difficulties in our reimplementation on CIFAR-10 and CIFAR-100 tasks.

What was Difficult

We didn't face a lot of hurdles in reproducing most of the baselines to test the efficacy of the proposed algorithm, however, due to computational constraints, we were not able to test the method on larger datasets like ImageNet.

Communication with Original Authors

We didn't face any substantial hurdles or didn't require further clarification, thanks to the clear details provided in the paper and hence didn't feel the need to contact the original authors. We however did have meaningful discussion around the intuition behind the proposed Funnel activation along with our research colleagues and experts working in the same domain.

Introduction

Activation functions have recently seen an upsurge in research as more evidence has been pointing towards smoother activation functions being significantly superior to the longstanding default ReLU activation function which has a distinct disadvantage of thresholding negative weights thus limiting information propagation.
However, the reason ReLU has prevailed since the advent of deep neural networks is because of its simplicity in terms of computation complexity and its versatility on different tasks.
The authors in this work propose a new addition to the ReLU activation family - Funnel Activation. Funnel Activation is a simple yet efficient spatial modeling based activation function where at the core it retains the formulation of ReLU being max(x,y) while simply replacing the y with a 2D spatial condition function \mathbb{T(x)} where x is the input to the activation function or essentially the pre-activation. By introducing this spatial condition, Funnel was able to significantly improve the generalization of deep neural networks.
As part of the ML Reproducibility Challenge, we reimplemented Funnel in PyTorch and investigated the efficacy and benefits of the spatial condition, centrifugal to the proposal behind Funnel activation and compared it to other standard well-known activation functions.

Scope of Reproducibility

In our attempt of reproducing the results presented in the paper, we focus on the following target questions:
  1. Is pixel-wise spatial modeling useful as a condition when constructing the ReLU activation function?
  2. Does Funnel outperform more complex activation functions which are smoother alternatives of ReLU aimed at improving information propagation at depth?

Methodology

The authors provide publicly accessible code for Funnel activation based on the MegEngine Framework. For ease of reproducibility, we reimplement the code using PyTorch and we further integrate the features provided in Weights & Biases for extensive evaluation and transparency of experiments.

Funnel Activation

Comparison between ReLU, PReLU and FReLU
The diagram shown above paints a clear picture of the structure of Funnel Activation (FReLU) as compared to its predecessors PReLU and ReLU. Before discussing the formulation of FReLU, it's fair to understand the formula of ReLU and PReLU. Arguably the most simple out of the lot, ReLU returns the input value x \in X \ \forall \ x \geq 0 and 0 \ \forall \ x < 0, where X is the full input tensor. To simplify, ReLU can be represented as max(x,0). PReLU is a simple extension to the formula of ReLU, which can be defined as max(x,px) where p is a learnable parameter. Similarly, FReLU is a simple extension to PReLU, mathematically defined as max(x,\mathbb{T(x)}) where \mathbb{T(x)} is a 2D parametric spatial function. The intuition behind FReLU is that while computing the activation for a pixel, instead of comparing the target pixel to 0 as in the case of ReLU or a parametric modulated value of the same pixel, it's rather beneficial to compare it with the parametric aggregated information of the spatial locality where the target pixel is centered at.
We define FReLU in code using the PyTorch code as shown below:
import torchimport torch.nn as nnclass FReLU(nn.Module): r""" FReLU formulation. The funnel condition has a window size of kxk. (k=3 by default) """ def __init__(self, in_channels): super().__init__() self.conv_frelu = nn.Conv2d(in_channels, in_channels, 3, 1, 1, groups=in_channels) self.bn_frelu = nn.BatchNorm2d(in_channels) def forward(self, x): x1 = self.conv_frelu(x) x1 = self.bn_frelu(x1) x = torch.max(x, x1) return x
As shown in the code snippet, essentially FReLU adds another extra layer of Convolution and Batch Normalization, and uses that output as a condition for the activation with the input from the previous layer. As compared to other activations which are mostly non-parametric or involve a single parameter, FReLU adds a significant computation overhead but is still relatively small due to the usage of depthwise separable convolution kernels for FReLU.

Experimental Settings

Models

For all the experiments conducted on the CIFAR-10 and CIFAR-100 datasets, we used the Residual Network family of architectures namely ResNet-18, ResNet-34 and ResNet-50. For our ablation study on hyper-parameter optimization using Weights & Biases Sweeps feature, we used only ResNet-18 based architecture on the CIFAR-10 dataset.

Datasets and Training

We conducted all the experiments on CIFAR-10 and CIFAR-100 datasets. On both the datasets, we trained all our models for a maximum of 200 epochs each, optimized by SGD with momentum using a multi-step decay learning rate policy by a factor of 10 at the 100th and 150th epoch in the training process. We applied a weight decay of 5e-4 and start the training with a base learning rate of 0.1. Additionally for the dataloaders, we used a batch size of 128 for all the benchmark experiments.

Hyperparameter Optimization

For the hyperparameter optimization experiments using Sweeps, we used the following parameters as defined in the following config file to obtain the results:
program: train_cifar.pymethod: bayesmetric: name: loss goal: minimizeparameters: act: values: ["ReLU", "Mish", "Swish", "Funnel", "DyReLUA", "DyReLUB"] optimizer: values: ["adam", "sgd"] batch_size: values: [64, 128, 256]
For all the runs using Sweeps we used the ResNet-18 architecture on the CIFAR-10 dataset only.

Baselines

We compared our results using the same networks defined in the previous two sections but equipped with several commonly used activation functions like ReLU, Mish, Swish, DyReLU-A and DyReLU-B. We'll discuss the results in the next section.

Computational Requirements

We ran all our experiments on three different clusters equipped with NVIDA P100, T4 and K80 GPUs. For each group, we used parallel nodes for faster runs. Additionally, we also utilized free compute resources like Google Colaboratory for extensive small scale experimentation. Each group of runs usually took on average 7 - 11 hours to complete.

Results

Benchmarks

As shown in the plots below, Funnel clearly outperforms other activation functions on the two datasets - CIFAR-10 and CIFAR-100 by a significant margin across nearly all the models used in the experiments. Notably, we can see that Funnel beats every other activation function in the CIFAR-100 benchmarks when used with a ResNet-50 model. Overall, Funnel obtains the highest CIFAR-10 and CIFAR-100 accuracies across all the models and the lowest CIFAR-100 loss across the models. All the models were run for 5 times each and the mean and the standard deviation for each model is plotted below in the panels.

Sweeps

We further validated the efficiency and efficacy of Funnel activation using the Sweeps feature in Weights & Biases. For the experiment we used a ResNet-18 model trained for 50 epochs for each run. As shown in the plot below, Funnel with a combination of 256 batch size and Adam optimizer obtains the lowest loss. We also show a run comparison below where one can observe the learnt spatial modeling of each layer where Funnel activation is used.