Skip to main content

3D Perception in Carla Simulator

Collecting data from Carla simulation and training 3D perception models on the gathered dataset.
Created on September 1|Last edited on January 9
SFA3D model predictions on one of the samples from test set.


Agenda



Overview

Extracting precise 3D object information is one of the prime goals for comprehensive scene understanding. However, labeling errors are common in present open-source 3D perception datasets, which could have impactful consequences. To tackle this issue, we use Carlafox to generate an error-free synthetic dataset for 3D perception automatically.
Deep 3D object detectors may become confused during training due to the inherent ambiguity in ground-truth annotations of 3D bounding boxes brought on by occlusions, missing, or manual annotation errors, which lowers the detection accuracy. However, existing methods overlook such issues to some extent and treat the labels as deterministic. It is possible to create enormous datasets without cost by using a virtual simulation with known labels. According to research, using both simulated and real data helps AI models become more accurate. Our results show that simulated data can significantly reduce the amount of training on real data required to achieve satisfactory levels of accuracy.

Key Takeaways

  • Simulated data is becoming more crucial than ever in autonomous driving applications, both for testing pre-trained models and for developing new models.
  • It is imperative that the underlying dataset contains a variety of driving scenarios and that the simulated sensor readings closely resemble real-world sensors for the neural network models to generalize to real-world applications.
  • Carlafox is able to export high-quality, synchronized LIDAR and camera data with object annotations, and offers a configuration to accurately reflect a real-life sensor array.
  • Furthermore, we use the Carlafox tool to generate a dataset consisting of 10,000+ samples and use this dataset to train SFA3D, a fast open-source 3D object detection neural network.
  • For testing, we integrate the model back into Carlafox and visualize it against the ground truth data from the simulator.

Data Collection

Carlafox, a web-based CARLA visualizer, substantially demystifies the arduous task of synthetic dataset generation for 3D object detection. We use Carlafox to set up sensor configurations, create diverse weather conditions, and generate data from different maps in the KITTI format. One of the advantages of the dataset is that the open-source CARLA simulator was used to recreate the same LiDAR and camera configurations used to generate the original KITTI data.
Bounding boxes projected into LiDAR space with calibration data.
The objective is to offer a challenging dataset to assess and enhance approaches in complicated vision tasks, such as 3D object detection. In total, the dataset has 12807 cars, 10252 pedestrians, and 11624 cyclists. The dataset contains 2D and 3D bounding box annotations of the classes: Car, Pedestrian, and Cyclist and contains both LIDAR and camera sensor data, as well as the generation of sensor calibration matrices.
2D bounding box ground truth annotations with labels and occlusion stats. Occlusion stat 0(green) represents object is fully visible, 1(yellow) represents partially occluded and 2(red) represents heavily occluded.
3D bounding box ground truth annotations.

Training 3D Perception models on CARLA dataset

Due to its numerous applications across various industries, including robotics and autonomous driving, 3D object detection has been gaining more attention from businesses and academia. LiDAR sensors are commonly used in robotics and autonomous vehicles to collect 3D scene data as sparse and erratic point clouds, which has shown to serve as helpful cues for 3D scene perception and comprehension.
We trained quite a few LiDAR-based networks, namely PointRCNN, PVRCNN, and SFA3D, and a Multimodal(RGB + LiDAR) 3D object detection model, i.e. MVXNet on the CARLA synthetic dataset but fine-tuned only one of these i.e., SFA3D, mainly because it is faster and uses less memory without much loss in performance. Although, any other model could have performed better if optimized and tuned further than just training a baseline, as shown in the following panel.
Below is the comparison for all the models we trained on the Carla synthetic dataset which was collected in KITTI format so, the evaluation was done using the kitti evaluation scripts. KITTI difficulties are defined as follows:
  • Easy: Min. bounding box height: 40 px, Max. occlusion level: Fully visible, Max. truncation: 15 %
  • Moderate: Min. bounding box height: 25 px, Max. occlusion level: Partly occluded, Max. truncation: 30 %
  • Hard: Min. bounding box height: 25 px, Max. occlusion level: Difficult to see, Max. truncation: 50 %

Run set
5


A closer look into SFA3D

Super Fast and Accurate 3D object detection is based on 3D LiDAR Point Clouds. The ResNet-based Keypoint Feature Pyramid Network (KFPN), builds the backbone of the detector and was proposed in RTM3D.
The model takes a bird's-eye-view (BEV) map as input. The height, intensity, and density of 3D LiDAR point clouds are used to encode the BEV map. On the other hand, it outputs a heatmap for the main center, the center offset, the heading angle, the dimensions of the object, and the z coordinate.
SFA3D test prediction projected on RGB image (above) and projected in LiDAR space (below)
As for the loss functions, the focal loss is used for the main center heatmap, and l1 loss for the heading angle (yaw). It employs balanced l1 loss for the z coordinate and the three dimensions (height, width, and length). We trained the model for a total of 300 epochs by setting equal weights for the aforementioned loss components using a cosine LR scheduler with an initial learning rate of 0.001 and a batch size of 32 (on two RTX 2080Ti). Refer to the following wandb panels for results with SFA3D experiments on all towns and Town01, respectively.
  • SFA3D results on Carla dataset with samples from all carla towns.

Run set
2


  • SFA3D results on samples only from Carla Town01.

Run set
2



Optimising with TensorRT

TensorRT enables developers to optimize inference by leveraging CUDA libraries. TensorRT supports both INT8 and FP16 post-training quantization, which greatly reduces application latency and is required for many real-time services, as well as autonomous and embedded applications.
As a first step, we convert the SFA3D PyTorch model to ONNX, and use the ONNX parser to convert ONNX model to TensorRT. We could also bypass the parser and directly convert from PyTorch to TensorRT, doing so would require us to write the SFA3D network in TensorRT network-definition API, which would be time intensive and result in negligible speed benefit but could be more efficient on an embedded device like a Jetson Nano.
In addition, we examined benchmarks across multiple frameworks like TVM and ONNX to ensure that TensorRT is the best performing. From the above results, it is clear that TensorRT aids in obtaining higher throughput on the same hardware. Furthermore, quantization to FP16 boosts performance even more. On RTX2080Ti, TensorRT may be the most efficient solution for SFA3D, but it's also possible that another framework, such as Apache TVM, performs better on a different device with the same or another network; thus, results may vary depending on the hardware.

Run set
5

On RTX2080Ti, TensorRT may be the most efficient solution for SFA3D but it's also possible that another framework, such as Apache TVM performs better on a different device with the same or a different network, thus results may vary depending on the hardware.

Testing

To make it easier to compare the model's predictions with CARLA's ground truth, we incorporated the model into Carlafox and made them available in a separate Foxglove image panel. For more details on the Carlafox visualizer, please refer to this dedicated blogpost. Try it out for yourself by connecting to a live simulation – we provide a live demo environment with the trained model integrated to test the setup quickly.

Outlook

Numerous open-source resources paved the way for us to accomplish our work. In the future, we plan to finetune the trained models with the official KITTI dataset. Because of the expenses associated with acquiring real-world data, the use of synthetic data for training machine learning models has grown in popularity in recent years. This is especially true in the case of autonomous driving due to the rigorous requirement of generalizability to a wide range of driving conditions, so we hope our findings help others in research and development.
If you have questions or ideas on how to leverage synthetic data for 3D perception, join us on our Gitter #lounge channel or leave a comment in the comment section.
Vineet Suryan
Vineet Suryan •  *
Thanks @Mahmoud. It is open-source here => https://github.com/collabora/carlafox
Reply
Mahmoud
Mahmoud •  *
Really a great work. I was searching for something like that for a while Could you share with us testing script you used to integrate Carla, Carlafox, SFA3D, and wandb.ai for metrics display? The blogpost link above is broken thanks
Reply