Lyft’s High-Capacity End-to-End Camera-Lidar Fusion for 3D Detection
Autonomous vehicle perception depends on complementary sensor modalities: camera, lidar, and radar.
End-to-End Fusion Performs Better Than Two-Stage Fusion
- The fusing of image and lidar features happens inside the model and is end-to-end learned during training. This means fewer heuristics and hyperparameters, and the network has the capacity to learn more complex mappings from input to output.
- EEF models can be trained to be fault-tolerant against missing data in either modality through data dropout during training. In comparison, it is harder for two-stage models to achieve the same level of fault-tolerance.
In the following figure, you can see that an end-to-end fusion model gives a much higher AP@0.5 than a two-stage fusion model for cars, pedestrians, and cyclists on our in-house dataset. We observed the same trend for the other classes (cars and cyclists) as well. For simplicity, we only compare AP for pedestrians in the rest sections of the report.
End-to-End Fusion Performs Better Than a Lidar-Only Model
- Lidar-only (no fusion): lidar-only model
- End-to-end fusion: end-to-end learned fusion with image & lidar
Balancing Model Capacity & Overfitting
Higher Model Capacity = Boosted Accuracy
Tackling Overfitting with Data Augmentations
Faster Training: Iteration Speed Matters
Faster Training from Image Dropout and Using Half-res Images
- Compute is large considering contributions from the high-capacity.
- GPU memory footprint of the model is large. This means the model is more prone to running out of GPU memory when using a large batch size during training. Forced to use a smaller batch size means slower training.
- Data loading takes time considering the large input (6x full-resolution camera images + lidar spin, more details could be found in our Lyft Level 5 Open Dataset).
- 6 cam, full-res: Baseline EEF.
- 2 cam, full-res: Random image dropout was used during training: we randomly dropped out 4 out of 6 images during training. This reduced compute and data loading time and led to 2.3x faster training. In addition, the dropout helped regularization, and likely because of this, we observed better accuracies (shown below: +1.5% AP@0.5 for pedestrians).
- 2 cam, half-res: We further reduced image resolution by using half-res images. This further led to 2x faster training than using full-res images without much accuracy regression.
- We demonstrated that high-capacity end-to-end fusion performs better than two-stage fusion or lidar-only models for 3d detection.
- We demonstrated our workflow in training high-capacity models, reducing overfitting while increasing model capacity, and maintaining fast iteration speed.