Lyft's High-Capacity End-to-End Camera-Lidar Fusion for 3D Detection

Learn how Lyft Level 5 combines multiple perception sensors in their self-driving automobile research. Made by Weights & Biases Case Studies using W&B
Weights & Biases Case Studies
This report is authored by by Andrew Zhong & Qiangui (Jack) Huang, Lyft Level 5 (to be acquired by Woven Planet, subsidiary of Toyota)

Introduction

Autonomous vehicle perception depends on complementary sensor modalities: camera, lidar, and radar.
For example, lidar provides very accurate depth but could be sparse in the long-range. Cameras can provide dense semantic signals even in the long-range but do not capture an object’s depth.
By fusing these sensor modalities, we can leverage their complementary strengths to achieve more accurate 3D detection of agents (e.g. cars, pedestrians, cyclists) than solely relying on a single sensor modality. To accelerate autonomous driving research, Lyft shares data from our autonomous fleet to tackle perception and prediction problems.
An example of the Lyft Level 5 dataset that touches on various modalities (lidar, radar & camera)
There are many strategies for fusing cameras and lidar. Interested readers can find a comprehensive review of various fusion strategies below in resources [1-6]. Here, we only consider two strategies that both utilize deep neural networks.
One strategy is through two-stage fusion like frustum pointnets [1]: the first stage is an image detection model that takes camera images and outputs 2d detections. The second stage takes the 2D detection proposals, creates frustums from them in 3D, and runs lidar-based detection within each frustum. We adopted this strategy in our early stack and achieved good results (check our blog post for more details).
However, two-stage fusion performances are limited by either the vision model or the lidar model. Both models are trained using single modality data and have no chance to learn to directly complement the two sensor modalities.
An alternative way of fusing camera and lidar is end-to-end fusion (EEF): using one single high-capacity model that takes all raw camera images (6 of them in our case) and lidar point cloud as input and trains end-to-end. The model runs separate feature extractors on each modality and performs detection on the fused feature map. In this report, we will discuss our take on end-to-end camera-lidar fusion and present experimental results conducted with Jadoo, our internal ML framework, and W&B.

End-to-End Fusion Performs Better Than Two-Stage Fusion

End-to-end trained EEF has several key advantages over two-stage fusion:
In the following figure, you can see that an end-to-end fusion model gives much higher AP@0.5 than a two-stage fusion model for cars, pedestrians, and cyclists on our in-house dataset. We observed the same trend for the other classes (cars and cyclists) as well. For simplicity, we only compare AP for pedestrians in the rest sections of the report.

End-to-End Fusion Performs Better Than a Lidar-Only Model

We ran this experiment to understand if fusing image features will increase modeling accuracies. We used the following configurations:
In the following figures, we demonstrated `End-to-end fusion` performs better than `Lidar-only (no fusion)` on AP@0.5 for pedestrians within the 75m range.

Balancing Model Capacity & Overfitting

Higher Model Capacity = Boosted Accuracy

We increased the capacity of feature extractors within the EEF model and trained three configurations of increasing capacity: EEF baseline, EEF higher capacity, and EEF highest capacity.
In the following graphs, we demonstrated that higher model capacity helps. The highest model capacity (in the number of point cloud encoder/decoder features) boosted AP@0.5 for pedestrians within the 75m range by ~5%.

Tackling Overfitting with Data Augmentations

EEF model is a high capacity model and is more prone to overfitting. One sign of overfitting is that the training metric improves while the validation metric does not improve (or an increased gap between train and validation metric).
We applied both image and lidar data augmentations (DA) and significantly reduced the overfitting. In the following graphs, we demonstrated different types of image-based DA reduced the gap between train and validation AP@0.5 for pedestrians by up-to 5% and improved model generalization.

Faster Training: Iteration Speed Matters

Faster Training from Image Dropout and Using Half-res Images

Training the high-capacity EEF model could be slow because:
Baseline EEF trains on all 6 full-res images. It took 4.3 hours to train 100 epochs on 64 GPUs. In order to iterate faster on EEF, we need to speed up training. We experimented with the following configurations:
With the help from both image dropout and half-res images, we speeded up EEF training by 4.3x and also improved model accuracies.

Mixed-precision Training

We used the automatic mixed-precision from PyTorch and further speeded up training by 1.2x, reduced GPU memory footprint without regressing model accuracies.

Impact

Learn More about Lyft Level 5

Acknowledgments

This report is based on the work done by the authors and Anastasia Dubrovina, Dmytro Korduban, Eric Vincent, and Linda Wang.

References

[1] Qi, Charles R., Wei Liu, Chenxia Wu, Hao Su, and Leonidas J. Guibas. “Frustum pointnets for 3d object detection from rgb-d data.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927. 2018.
[2] Xu, Danfei, Dragomir Anguelov, and Ashesh Jain. “Pointfusion: Deep sensor fusion for 3D bounding box estimation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 244–253. 2018.
[3] Sindagi, Vishwanath A., Yin Zhou, and Oncel Tuzel. “MVX-Net: Multimodal VoxelNet for 3D Object Detection.” arXiv preprint arXiv:1904.01649 (2019).
[4] Sourabh Vora, Alex H. Lang, Bassam Helou, Oscar Beijbom “PointPainting: Sequential Fusion for 3D Object Detection”, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4604-4612
[5] Liang, Ming, Bin Yang, Yun Chen, Rui Hu, and Raquel Urtasun. “Multi-Task Multi-Sensor Fusion for 3D Object Detection.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7345–7353. 2019.
[6] Liang, Ming, Bin Yang, Shenlong Wang, and Raquel Urtasun. “Deep continuous fusion for multi-sensor 3d object detection.” In Proceedings of the European Conference on Computer Vision (ECCV), pp. 641–656. 2018.
[7] Eric Vincent and Kanaad Parvate, Andrew Zhong, Lars Schnyder, Kiril Vidimce, Lei Zhang, Ashesh Jain, “Leveraging Early Sensor Fusion for Safer Autonomous Vehicles
[8] Sammy Sidhu, Qiangui (Jack) Huang, Ray Gao. "How Lyft Uses PyTorch to Power Machine Learning for Their Self-Driving Cars"
[9] Lyft Level 5 AV Dataset 2020.
[10] "Automatic Mixed Precision package - torch.cuda.amp"