Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training (ECCV 2020)

Reproducing the ECCV 2020 paper Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training by Hongkai Zhang, Hong Chang, Bingpeng Ma, Naiyan Wang, and Xilin Chen.
Sparsha
Paper | Code

Summary

Scope of Reproducibility

Two stage object detection architectures have grown in prominence over the years in the domain of computer vision, however, the training regime of these detectors are not fully interpretable.
This paper is built upon the inconsistency problem between the fixed network settings and the dynamic training procedure, which greatly affects performance. This paper proposes a new detector called as "Dynamic RCNN" to adjust the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of SmoothL1 Loss) automatically based on the statistics of proposals during training.

Methodology

While the authors provided the official implementation on their GitHub repository, we decided to stick to the implementation made available on the MMDetection framework which has been acknowledged by the original authors. Our reasoning behind using the MMDetection framework was that it has pre-built, validated config files for Dynamic RCNN along with all utility files required for training the same with in-built support for Weights & Biases logging which made it simple and easy to train and track the Dynamic RCNN in our experiment. We ran the experiment on the MS-COCO dataset doing object detection using a pre-trained ResNet-50 as backbone. For our experiments, we used a NVIDA Tesla V100 GPU on the Google Cloud Platform (GCP).

Results

We replicated the Dynamic RCNN with 1x learning policy and a pre-trained ResNet-50 network as the backbone imported from torchvision on the MS-COCO dataset. We were able to obtain a validation AP of 38.9, 0.2 less than the value reported in the paper (39.1) and a validation AP_{50} of 57.5, 0.5 less than the value reported in the paper (58.0).

What was easy

Due to pre-defined verified training and model config files available on the MMDetection framework along with the direct support for Weights & Biases, the whole experiment pipeline was simple and streamlined without requiring any further modification. We also implemented a simple hook based off the documentation provided on Weights & Biases, to capture the bounding box results on a few sample images.

What was difficult

The only bottleneck we faced was the lack of compute resources which prevented us from experimenting with all the variants of Dynamic RCNN specified in the paper.

Introduction

As compared to image classification task, the annotations in this object detection task are the ground truth object bounding boxes. This makes it unclear to assign positive or negative labels for the proposals in the classifier training since their separation might be ambiguous.
The most widely used strategy is to set a threshold for the IoU of the proposal and corresponding ground-truth. One of the most popular architectures in the domain - Cascade RCNN states that training a classifier with certain IoU threshold will lead to performance degradation at other IoUs. However, one cannot directly set a high IoU from the beginning of the training due to the scarcity of positive samples. The solution that Cascade R-CNN provides is to gradually refine the proposals by several stages, which are effective yet time-consuming. As for regressor, the problem is similar. During training, the quality of proposals is improved, however the parameter in SmoothL1 Loss is fixed. Thus it leads to insufficient training for the high quality proposals.
To solve the aforementioned issues, the authors propose "Dynamic RCNN" which consists of two components:
  1. Dynamic Label Assignment (DLA)
  2. Dynamic SmoothL1 Loss (DSL)
This new approach not only allows us to solve the data scarcity issue at the beginning of the training but also harvest the benefit of high IoU training. Additionally, these two modules explore different parts of the detector, thus could work collaboratively towards high quality object detection.

Scope of Reproducibility

The scope of our reproducibility is encapsulated in a bilateral fashion:

Methodology

To better exploit the dynamic property of the training procedure, the authors propose Dynamic R-CNN which as shown in the above figure. The key insight is adjusting the second stage classifier and regressor to fit the distribution change of proposals. As described earlier, Dynamic RCNN is made up of two primary components: 1. Dynamic Label Assignment (DLA) and 2. Dynamic SmoothL1 Loss (DSL) where the former is structured for the classifier and the latter for the regressor. We formulate these two components in details in the subsequent sections.

Dynamic Label Assignment (DLA)

The DLA module can be represented by the following mathematical formula:
label = \begin{cases} 1 & \text{if max} \ IoU(b, G) \geq T_{now} \\ 0 & \text{if max} \ IoU(b, G) < T_{now},\end{cases}
where T_{now} refers to the current IoU threshold.

Dynamic SmoothL1 Loss (DSL)

The DSL module can be represented by the following mathematical formula:
SmoothL1(x, \beta) = \begin{cases} 0.5|x|^2/\beta & \text{if } |x|<\beta \\ |x| - 0.5\beta & \text{otherwise} \end{cases}
where x stands for the regression label and \beta is a hyper-parameter controlling in which range we should use a softer loss function like l_1 loss instead of the original l_2 loss and is set to a default value of 1.0.
Further details on DLA and DSL can be accessed in the paper.
Pseudo code for Dynamic RCNN

Experimental Settings

Models

For our experiment, we used the Dynamic RCNN model with a pre-trained ResNet-50 as the backbone. We used the prebuilt config provided in the MMDetection framework written in PyTorch for our experiment.

Datasets and Training

We used the MS-COCO 2017 dataset for the object detection experiment. Following the common practice as stated in the paper, we used the COCO train split (∼118k images) for training and report the validation results on the val split (5k images). The COCO-style Average Precision (AP) was chosen as the main evaluation metric which averages AP across IoU thresholds from 0.5 to 0.95 with an interval of 0.05. We also used the 1x learning rate schedule for consistent setting as defined in the paper.

Computational Requirements

For our experiment, we used a NVIDIA Tesla V100 cluster on GCP with 8 vCPUs and 30 GB RAM along with 300 GB disk storage. Our run took a total time of 1 day 18 hours and 30 minutes.

Results

We retrained the Dynamic RCNN with a pre-trained ResNet-50 using the 1x learning rate schedule for our experiment. The table below provides the results of our run as compared to those reported in the paper. As shown, our reproduced AP and AP_{50} trails the paper's values by 0.2 and 0.5 respectively while in case of AP_{S}, AP_{M} and AP_{L} our reproduced results beat the values in the paper by a significant margin of 0.6, 1.0 and 0.8 respectively. Our run's AP_{75} value matches the value reported in the paper. Our reproduced results strongly validate the reproducibility of the Dynamic RCNN models and also provides strong hint of even higher performance level because of randomisation or the use of stronger backbones.
AP AP<sub>50</sub> AP<sub>75</sub> AP<sub>S</sub> AP<sub>M</sub> AP<sub>L</sub>
Original Paper Results 39.1 58.0 42.8 21.3 40.9 50.3
Our Reproduced Results 38.9 57.5 42.8 21.9 41.9 51.1
For transparency and fair comparison, we provide the validation and training curves of the model trained in the plots shown below. We also provide analytical insight into the training development of the model via the bounding box visualizer shown in the panel below where one can view the model's bounding box output for sample images at every epoch during the training of the Dynamic RCNN model. Subsequently we also provide the graphs of the system statistics during the training for absolute understanding.