DETR: Panoptic segmentation on Cityscapes dataset
Abstract
In May 2020 Facebook AI research proposed the paper "End-to-End Object Detection with Transformers" [1] that views object detection as a direct set prediction problem.
The code is publicly available in the GitHub FAIR repository [2] and is designed to work with the COCO dateset, providing also the panoptic segmentation [3] feature.
This work compare different strategies for fine-tuning the pretrained transformer model on the Cityscapes dataset [4].
The implementation is available at https://github.com/DanieleVeri/fair-DETR.
1. Environment setup
The DETR repository contains a collection of python scripts that are organized in three main packages: model, dataset and utils.
The model package provides the implementation of the DEtection TRansformer and the version with the mask head for panoptic segmentation.
The dataset package takes care to load and parse the COCO dataset and perform data augmentation techniques.
Finally the utils package implements miscellaneous functionalities like handle the execution on GPUs in a distributed environment.
In order to work with the Google Colab runtime the code is refactored allowing the execution as Jupiter notebook, disabling the distributed mode and removing all the unused features.
2. Dataset
A collection of scripts [5] is used to download the Cityscapes dataset and convert it to the COCO format, allowing the reuse of the functions responsible for loading and parsing the dataset.
The used dataset (leftImg8bit_trainvaltest with gtFine labels) consist in an archive of 5000 images where about 3000 make the train set, 500 the validation set and 1500 the test set (unused).  
In particular the COCO format [6] for panoptic segmentation consists in a PNG that stores the class-agnostic image segmentation and a JSON struct that stores the semantic information for each image segment. The unique id of the segment is used to retrieve the corresponding mask from the PNG while the category_id gives the semantic category. The isthing attribute distinguishes stuff and thing categories.
While the COCO dataset express segmentation as a list of vertex (that will be converted to the mask image) the Cityscapes script already outputs the PNG.
3. Training speedup
In the original work the model was trained for 3 days using a cluster of 16 Tesla V100 GPUs for a total of 500 epochs.
The environment provided by Colab Pro offers just one Tesla P100/V100 depending to the availability, so it's necessary to make the train routine as efficient as possible in order to run a significant number of epochs.
Profiling the code shows up that the data augmentation process that consists in random scaling, flip and cropping consumes most of the time. Assuming that the pretrained model has properly learned a robust image representations this component can be turned off.
Finally the batch size is set to 1 to act as a regularizer during the training.
4. Transfer learning
In order to fine tune the pretrained model on another dataset, several strategies are employed.
All experiments are tracked with Weights and Biases which allows to monitor metrics, gradients and handle models with a versioning system.
The starting point is the pretrained model on COCO dataset for panoptic segmentation task, with the ResNet50 backbone.
NOTE: The convolutional backbone is always ResNet50 and is frozen during training. A step in the x-axis correspond to 10 batches.
4.1 First step
Some labels like car, person, sky... are already present in the COCO dataset but with a different category ID, so in order to map the image representation to the new categories, the fine tuning is performed keeping also the transformer frozen, training only the classification, bbox and mask heads.
In fact poorly initialized final layer could corrupt the image features learned by the network.
This run (freeze_transformer) is compared with the fine_tune run which performs simple fine tuning without freezing the transformer and the smaller_scratch run that train from scratch a smaller architecture.  
Plots report panoptic quality metrics on validation set and losses on training set.
Keep the transformer frozen outperformed other models.
The performance on stuff is higher than things categories because the attention mechanism improve results on large objects.
Train a transformer from scratch would require too many epochs before it reach output prediction with a confidence greater than the threshold, and this is the reason of 0 scores on validation set.
Below are reported per class scores:

4.2 Second step
Once the panoptic quality on validation set reach the plateau, more layers are unfrozen from the model with the highest score (tranformer_freeze).
Also in this step are explored several strategies:
freeze_decoder: only the encoder is trained. It is responsible for the latent image representation.freeze_first_layers: only the last 3 layer of both encoder and decoder are trained. It follows the principles of fine tuning deep networks, where the first layers activates on general image features while the latters on more problem specific features.freeze_attention: only the attention layers are frozen. Respect to a linear layer, the attention mechanism act with two level of indirection: weights are computed on the fly depending on the input and then are applied. This method allow a more flexible modeling of the problem as long as many training data points are provided. (Ciyscapes is a relatively small dataset)
The model with attention layers frozen performs better than the others.
Segmentation quality is almost constant, while the recognition quality improves of about 10 points.
Validation losses shows no overfitting except for the decoder_freeze run that was early stopped.
Below are plotted per class scores:

Below are reported some gradients of the attention_freeze run during training.
Gradients are sampled from the first linear layer of layers 1, 4 and 6 (however this behavior is consistent with all the other layers).
On the left side are reported decoder layers while on the right the encoder ones, showing that the gradients magnitude is greater in the encoder that is learning the latent image representation than in the decoder that process object queries.
5. Conclusion
This work shows that fine tuning with a proper strategy can lead to a score of 43 PQ in about 50 epochs. Even if the the results are not comparable with the Cityscapes benchmarks [7], they are however satisfactory:

Transformers are good in global reasoning but are computational expensive with long inputs (high resolution images), making difficult to attain good results with small objects.