Skip to main content

Multiformer

A multi-task transformer for autonomous driving perception.
Created on November 30|Last edited on December 13
This is the training report for the Multiformer model, which focuses on the training configurations and results. For an in-depth discussion of this project, its motivation, and the research behind it, please see the official blog post. The pretrained weights are available for use on the Hugging Face Hub. See the project GitHub for setup instructions and training/inference scripts.
Figure 1: Anatomy of the Multiformer model.


Table of Contents:




Overview

Global-Local Path Network (GLPN) and Segformer both use the PVTv2 backbone to achieve state of the art results in their respective tasks of monocular depth estimation and semantic segmentation, in both cases requiring only the addition of lightweight, all-MLP decoding heads, demonstrating the descriptive power of the features generated by this hierarchical transformer backbone. Deformable DETR is a 2D object detection network that offered an improvement on the Detection Transformer (DETR) by introducing the more computationally efficient deformable attention mechanism (inspired by Deformable Convolution). Published before PVTv2, Deformable DETR did not provide benchmarks using this backbone, but Panoptic Segformer, which is an adaptation of Deformable DETR for Panoptic Segmentation that came out a year later, gives us a good look at just how powerful the features learned by these lightweight backbones are, as the configuration backed by the the PVTv2-B5 nearly matched the top-performing Swin-L backbone configuration with less than half the parameters.
This work can be seen as an extension of Deformable DETR for multi-task training and inference. Adding the decoding head from GLPN provides accurate monocular depth which can support downstream 3D detection, and provides an auxiliary supervision signal which introduces the benefits of multi-task training. The lightweight semantic segmentation head from Segformer can easily be added for additional auxiliary supervision signal and more holistic scene understanding at inference time. Comparison of the training runs below supports existing research in Multi-Task Learning (MTL) and demonstrates that simultaneous training on multiple tasks improves learning on the individual tasks, and makes a strong case for using this strategy whenever training labels are available or a teacher network can be utilized for pseudo-labeling.
This work is also an example of transfer learning, as the backbone and auxiliary task heads were pre-trained on the Cityscapes semantic segmentation task, then as a part of Multitask Segformer on depth and semantic segmentation using the synthetic SHIFT dataset before being used to train the Deformable DETR module. Although resources did not allow for making empirical comparisons against the same architecture trained without transfer learning, the fast convergence to acceptable metrics is an indication that the knowledge is transferring well between each round of task training.



Losses

Semantic segmentation was trained with cross-entropy loss. Depth was trained with a linear combination of SiLog and Mean Absolute Error (MAE), the latter being included to enforce true-scale accuracy. The 2D detection module uses the loss paradigm of Deformable DETR: bipartite matching via the Hungarian algorithm, focal loss for the box classification head, and a combination of normalized center coordinate L1 loss and generalized IoU (GIoU) loss for the box proposal head. The task losses were given empirically-determined lambda values of 5.0 for semantic segmentation, 1.0 for depth, and 1.0 for 2D detection. The lambda values for the 2D detection loss components were taken from the DETR literature, and shown below. Written formally, the total loss is calculated as follows:
depth=λSiLogSiLog+λMAEMAE\ell_{depth} = \lambda_{SiLog} \ell_{SiLog} + \lambda_{MAE} \ell_{MAE}

det2d=λfocalfocal+λL1centerL1center+λGIoUGIoU\ell_{det2d} = \lambda_{focal} \ell_{focal} + \lambda_{L1_{center}} \ell_{L1_{center}} + \lambda_{GIoU} \ell_{GIoU}

total=λsemsegsemseg+λdepthdepth+λdet2ddet2d\ell_{total} = \lambda_{semseg} \ell_{semseg} + \lambda_{depth} \ell_{depth} + \lambda_{det2d} \ell_{det2d}

where{λsilog=1.0,λmae=0.1,λL1center=5.0,λfocal=1.0,λGIoU=2.0,λsemseg=5.0,λdepth=1.0,λdet2d=1.0where \begin{cases}\lambda_{silog}=1.0,\\ \lambda_{mae}=0.1,\\ \lambda_{L1_{center}}=5.0,\\ \lambda_{focal}=1.0,\\ \lambda_{GIoU}=2.0,\\ \lambda_{semseg}=5.0,\\ \lambda_{depth}=1.0,\\ \lambda_{det2d}=1.0\end{cases}




Training Configurations

All training runs began with pretrained weights in the auxiliary heads and backbone, but randomly initialized weights in Deformable DETR module. The training runs span 150,000 training steps, or 8 epochs of batch size 8 over the 1Hz front-facing camera data from the SHIFT synthetic driving dataset, chosen because it is large, diverse, and offers a full annotation suite ideal for multi-task training experiments. An AdamW optimizer with cosine learning rate scheduler going from 0.0002 to zero was used for all runs.

Multiformer-M0

Deformable DETR released their model with 6 layers for both the encoder and decoder, using a hidden size of 1024 in these layers. Multiformer has been developed in resource-constrained environments, so with added inspiration from the success of lightweight heads in the Segformer and GLPN models which use the same backbone, these values were reduced in Multiformer to 3 layers in both the encoder and decoder with a hidden size of 256. This configuration is called Multiformer-M0, and proves to be quite effective even despite this substantially reduced parameter count of less than 1/3 the default Deformable DETR module configuration, likely another testament to the potency of the PVTv2 features. This configuration keeps the convention from Deformable DETR of not taking the highest-resolution feature map from the backbone, as doing so leads to an unacceptably large sequence length in the attention modules. The variants of the M0 configuration are described below:
  • multiformer-m0-det2d-only - This configuration was trained on only 2D detection, without the auxiliary supervision signals from the depth and semantic heads. This comparison shows the advantage of multi-task training over transfer learning even if you only have one desired inference task.
  • multiformer-m0-no-pos-embed - One of the interesting features of hierarchical transformers like the PVTv2 is that they exploit the capability of convolutional layers to encode positional information from zero-padding, allowing them to eschew the traditional use of position embedding in transformers. This then provokes the question: does the Deformable DETR module need to add position embeddings to feature maps generated from PVTv2, or is this redundant? This configuration switched off the sine position embedding being added into the feature maps to answer this question.
  • multiformer-m0-fusion-a - Configuration which concatenates the predicted semantic logits and log depth onto the input features of the Deformable DETR encoder. This is an attempt at prediction distillation, experimenting with the idea that the decoded semantic and geometric knowledge could be beneficial to the detection task.

Multiformer-M1

Training the Multiformer-M0 model showed comparably low performance in the smaller boxes (similar to Deformable DETR benchmarks, and common to the object detection task at large). Since the high-resolution feature maps excluded by default in Deformable DETR are known to be informative for dense prediction tasks like semantic segmentation and depth, it is a reasonable hypothesis that incorporating this discarded information would be beneficial to the fine-grained detection of small objects, and so an adaptation was made to the M0 architecture to include the largest feature map in the Deformable DETR encoder input by first passing it through a 2x2 channel remapping convolution layer to reduce its spatial dimension. In theory, if the output channel dimension is >= 4x the number of channels in the input feature map (as is the case here), the network could choose to reshape the original information depth-wise into the channels, so there isn't necessarily a compression occurring, but the reduced area keeps the attention complexity low. Although this leads to only a slight increase in parameter count from 8.1M to 8.3M, the increase in memory footprint during training is much more noticeable due to the increased sequence length through the attention modules.



Training Results

2D Detection


Run set
5

We can see that the larger M1 model does perform noticeably better in all mAP metrics except for large boxes, which is somewhat to be expected given the higher parameter count, however, it did see these gains almost immediately, which indicates it isn't only the size of the model but the quality of information it is receiving thanks to the inclusion of the highest resolution feature map that leads to this outcome.
While the position embedding-free configuration multiformer-m0-no-pos-embed was still able to perform relatively well on the 2D detection task, there was a slight advantage to using the sine position embeddings. Considering they provide this benefit with no additional learned parameters, it seems reasonable to use them always.
The M0 configuration multiformer-m0-fusion-a, which attempted prediction distillation by concatenating the outputs of the semantic and depth heads to the inputs of the Deformable DETR encoder, also showed a slight disadvantage compared to the default configuration. Although it still performs reasonably well, it appears that the naïve concatenation of the outputs into the feature space has confused the learning somewhat, and that either a more sophisticated method of fusion or longer training schedule may be required to unlock the potential benefit of this knowledge to the detection task. Future work into 3D object detection should explore how best to pass this information (particularly depth) into the 3D detection module, whatever form it takes.
The larger M1 configuration had a noticeable positive impact on box scores, with +1.25 overall mAP, +1.23 mAP medium, and +0.52 mAP small, which represent 3.19%, 1.92%, and 2.66% performance gains, respectively. Large box performance also saw a significant gain of +1.32 mAP in the M1 configuration, indicating that the information contained in the previously discarded feature map was beneficial to more than just fine-grained performance. Future work should investigate how this difference holds over longer training periods.



Auxiliary Tasks

We can see that the model performance on the pretrained tasks of depth and semantic segmentation only increased during training, meaning that learning 2D detection has not sacrificed performance or introduced negative transfer elsewhere. Interestingly, the removal of positional encodings in the Deformable DETR module seems to have had negative effects in the auxiliary task performances. Perhaps in trying to extract positional information from the backbone, the the Deformable DETR module was sending backprop gradients through it that were more destructive to the other tasks.
The semantic and depth heads were pretrained along with the backbone as part of the Multitask Segformer project, the precursor to this work, where these tasks are explored in more detail. For more information, please refer to the blog post and the training report for that project.


Run set
5




Conclusion

The PVTv2 backbone has shown adept ability to learn a common feature space which is useful to all three of the vision tasks trained, even using the smallest (B0) size. Further, similar to the findings of Segformer in semantic segmentation and GLPN in monocular depth estimation, the successful extraction of 2D detection from this feature space can be done with a surprisingly small number of parameters, achieving compelling performance with a Deformable DETR configuration which is less than 1/3 the default size for this module. Isolating 2D detection as the target task, multi-task training with supervision signals from semantic segmentation and depth showed a clear advantage over training 2D detection with transfer learning only, with an increase of +3.25 mAP (an 8.7% relative gain in performance), providing a strong case for using this training method whenever training labels are available, or a teacher network can be used to generate pseudo-labels, even when the final application only requires a single task. The representations learned will be more robust and general, as knowledge learned in other tasks can be leveraged to perform the target task.
The combination of multi-task prowess and parameter efficiency in computer vision demonstrated by a model like Multiformer is highly complementary to autonomous robotics applications, particularly at smaller scales, since it can perform inference at higher framerates and on smaller hardware. As the required footprint of the perception module decreases, the newfound memory surplus can be used to explore combinations with planning and control modules on consumer-sized hardware, democratizing research in full-stack autonomy. Further, since these weights have been trained on data captured in the CARLA simulator, they provide a great starting point for researching and developing autonomous vehicle stacks in that environment.