Training Semantic Segmentation Models for Autonomous Vehicles (A Step-by-Step Guide)
A short tutorial on leveraging Weights & Biases to train a semantic segmentation model for autonomous vehicles.
Created on August 26|Last edited on September 30
Comment
In this report, we'll use our full suite of tooling to train, test, and tune a semantic segmentation model for autonomous vehicles. We'll first select a baseline model and then improve it with Sweeps before a final fine-tuning. We'll also look briefly at using semantic segmentation for depth estimation.
Here's what we'll be covering:
Dataset OverviewLearning ObjectiveBaseline ExperimentsEdge CasesHyperparameter TuningKey Insights from our SweepFinal TrainingNext StepsImproving the Semantic Segmentation ModelSemantic Segmentation along with Depth EstimationSimilar Reports
Alright, let's jump in.
Dataset Overview
We are using the Cambridge-driving Labeled Video Database or CamVid to train our model. It contains a collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes.
CamVid Class Frequency
313
We are using Artifacts by Weights & Biases which makes it really easy and convenient to store and version our datasets. Creating a new version of the dataset and fetching a particular version takes only a couple of lines of code:
# Create Artifactwith wandb.init():artifact = wandb.Artifact('camvid-dataset', type='dataset')artifact.add_dir(dataset_path)wandb.log_artifact(artifact)# Fetching Artifactwith wandb.init():artifact = run.use_artifact('camvid-dataset:v0', type='dataset')artifact_dir = artifact.download()
We can use Tables in our Weights & Biases workspace to visualize and explore our images and segmentation labels. This table contains the number of pixels of each class that are present on one image. This is useful to filter images containing a given class.
Visualization of the CamVid Dataset using Weights & Biases Table
1
Learning Objective
Our model is supposed to learn a per-pixel annotation of a scene captured from the point of view of the autonomous agent. The model needs to categorize or segment each pixel of a given scene into 32 relevant categories such as road, pedestrian, sidewalk, cars, etc. as listed below in the gif of our product. You can click on any of the segmented images on the table shown above to interact with any given image. Click the small arrow next to the classes to expose the entire list.

Baseline Experiments
For our baseline experiments, we decided to use a simple architecture inspired by the UNet with a ResNet50, VGG19, and MobileNetV2 backbone which, in spite of being quite easy to implement, is also quite robust with respect to its performance.
We also incorporated the Chained Residual Pooling Layer as proposed by the creators of the RefineNet architecture, so that our model is able to capture background context from a large image region by efficiently pooling features with multiple window sizes and fusing them together with residual connections and learnable weights. We performed the baseline experiments with Focal Loss. We attach a brief summary of our experiments with the baseline models and the loss functions.
Results from Baseline Experiments
3
Edge Cases
For safety reasons, these classes will be our priority in training:
- Pedestrian 🚶♂️
- Bicyclist 🚴♂️
- Child 👶
- Car 🚗
- Heavy Vehicles 🚌
- Traffic Light 🚥
We'll use wandb.Table to log inputs and predictions of our models along metrics per class. We log the Dice coefficient per class. This metric is scored as 1 when we correctly classify the object and 0 when we don't. We can also filter and sort the tables dynamically to visualize where our models fail.
Let's examine the performance of the baseline models for these classes. We've sorted so our worst-performing data is visible first. Click on any image and toggle through the relevant class predictions.
Analyzing Edge Cases for Baseline Experiments
3
Note that the models fail to detect the high priority cases in a lot of the images resulting in lots of edge cases. We added a conditional filter in the tables to only show the cases where the model was able to detect all the high priority classes.
💡
Hyperparameter Tuning
In order to improve the performance of our baseline model, we need to not only select the best model, but also the best set of hyperparameters to train it with. This, in spite of being quite a daunting task, was actually made quite easy for us by Sweeps.
Sweeps makes it extremely easy for us to employ a Bayesian hyperparameter search method with the goal to minimize the loss of the model on the validation dataset.
From our experiments run using the Sweeps, we can see the performances of the models with various backbones and different sets of hyper-parameters, and based on that we can see which model performs the best as per our pre-determined metrics.
Hyperparameter Search using Weights & Biases Sweep
40
Key Insights from our Sweep
- Lower learning rate and lower weight decay results in better foreground accuracy and dice scores.
- Batch size and Image resize factor have strong positive correlations with the metrics.
- The VGG-based backbones might not be a good option to train our final model since they are prone to result in vanishing gradients.
- The ResNet34 or ResNet50 backbone should be chosen for the final model due to their strong performance in terms of metrics and faster inference times than other backbones.
Results of Best Experiments from Sweep
40
Baseline Experiments
1
Final Training
In order to finalize the model for this iteration, we decided to both fine-tune and fit-one-cycle a UNet with ResNet50 and ResNet34 backbones with the best set of hyperparameters for these two backbones from the sweep.
Final Training Experiments
4

Final Graph View for the Dataset Artifact
All the code used in our experiments in available in Github (with a hat tip to Thomas Capelle for the collaboration)
💡
Next Steps
Improving the Semantic Segmentation Model
- Collect more data, especially containing the classes with the highest priority to improve the performance of the model on edge cases.
- Experiment with Weighted CrossEntropy Loss in order to tackle the imbalance in the distribution of high-priority classes in our dataset.
- Experiment performance of the model with Multi objective Loss function: a weighted sum of CrossEntropy, Focal Loss, and Dice Loss.
- Experiment with more recent architectures such as DeepLabV3+, Bilateral Segmentation Network, Swin Transformer, etc.
Semantic Segmentation along with Depth Estimation
Our autonomous vehicle agent needs a complete 3D perception of the world surrounding it. We need a model that can estimate an overall perception of depth from a given scene besides segmentation. This could be achieved by creating two separate models for semantic segmentation and depth estimation but the models are expected to run in real-time in production, possibly on an onboard computer with limited computational resources. In order to overcome this problem, we chose a model that could simultaneously perform semantic segmentation as well as depth estimation using a singular shared backbone. We can use the model in this open-source project as our baseline for this task.
Hydranet Inference Results on KITTI
1

Sample 3D Point Clouds reconstructed from the predicted depth maps using Weights & Biases Rich Media Format
Similar Reports
Object Detection for Autonomous Vehicles (A Step-by-Step Guide)
Digging into object detection and perception for autonomous vehicles using YOLOv5 and Weights & Biases
Lyft's High-Capacity End-to-End Camera-Lidar Fusion for 3D Detection
Learn how Lyft Level 5 combines multiple perception sensors in their self-driving automobile research
Scaling Out Motion Prediction for Autonomous Vehicles with L5Kit, Ray, and W&B
In this tutorial, we'll show you how we easily organized and instrumented a prediction model for autonomous vehicle motion with W&B and scaled it out with Ray.
The ML Tasks Of Autonomous Vehicle Development
This report goes through the different tasks in the autonomous vehicle development lifecycle and the various machine learning techniques associated with them.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.