Modeling Drivable Areas for Autonomous Vehicles with Real and Synthetic Data
Autonomous vehicles need to understand where they can and can't drive. In this article, we'll dig into modeling for this use case with the help of Weights & Biases.
Created on September 11|Last edited on June 12
Comment
One of the main components of autonomous vehicles is understanding the world and its surroundings. We accomplish this with a combination of perception tasks like object detection, motion detection, semantic segmentation, and identifying drivable areas.
Below, you'll learn how you can use W&B to help train these drivable area models, specifically by leveraging real and simulated data to improve model performance. If you want to dive straight into the code, just click the link below to check out our GitHub repository.
Dive into the GitHub code
Problem Formulation
We'll use semantic segmentation to label pixels in the image if they're drivable or not (basically: is this a drivable road?).
- We'll start by training a simple image segmentation model on a natural dataset. This will be our baseline.
- However, gathering real-world data is extremely hard and expensive, so we will give our model a head-start by leveraging simulated/synthetic data. Thus we will pretrain on synthetic data and domain adapt to the real world data.
We want to see through experiments if synthetic datasets can be helpful for drivable segmentation (or any AV task).
Dataset Overview
Using the BDD100K Dataset
Since we don't have a fleet of autonomous vehicles at our disposal, we'll be using the BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning dataset to get us started. This is a large, diverse dataset that contains 100,000 driving videos collected from more than 50,000 rides in a variety of weather and light conditions. Each video is 40 seconds long and 30fps, which amounts to more than 100 million frames in total!
For our project, we'll use the 100K images taken at the 10th second in the videos. The split of train, validation, and test sets are the same as the whole video set. As shown below, the labels (masks) for drivable segmentation only have train and validation split; thus, we have made our own split.
- bdd100k- labels- drivable- masks- train- val
For the experiments in this report, this is what we will do:
- Randomly sample 30k images from the 70k train images. This is our train set.
- Randomly sample 10k images from the rest 40k images. This is our validation set.
- The official validation set with 10k images is our hold-out test set.
Shown below is our 30K training set logged as W&B Tables. Our training script will automatically download the dataset from this W&B Table (for all the splits) and prepare the dataloaders.
This dataset contains three types of annotations:
- Road: Regions of the image that are drivable.
- Alternate: Regions of the image where there is a road, but it can't be driven on (like a lane with traffic in the opposite direction).
- Background: Regions where there is no road and hence are not suitable for driving.
Let's quickly investigate the class imbalance in our split of the dataset.
As seen in the panel below, we have a lot of pixels corresponding to "background" while only a handful of pixels are associated with "drivable" road. This is expected as there's generally going to be a lot more background in a given image than an actual road.
Run set
1
Synthetic Data
Collecting real-world data and annotating it is, to put it mildly, rather expensive. To help solve this, we can leverage synthetic data to pre-train the model to learn useful concepts like "where is the road?" and, alternately, "what can't we drive on?"
For this project's synthetic data, we'll use still images annotated for semantic segmentation (19 classes) from the popular game Grand Theft Auto 5. The GTA5 dataset contains 24966 synthetic images from the car perspective in the streets of American-style virtual cities.
Since we're interested in learning where the road is on the given image, we have reduced the multi-class mask (19 classes) to a binary mask, where "1" is where the road is, and "0" is otherwise. Since we are pretraining using supervision, we have randomly sampled the data to get 20k train and 5k validation data.
NOTE 1: Pretrain using multi-class annotations and see if it improves the downstream domain adaption.
💡
NOTE 2: Random sampling is not a great strategy. Try stratified sampling to see if it helps in the downstream domain adaption.
💡
Shown below is the GTA5 dataset as W&B tables. Click on an image and click on the ">" button to visualize the images and toggle the masks.
Model
Now that we have reviewed our dataset(s), what model are we using to predict the drivable areas?
For simplicity's sake, we will use a vanilla UNET model from the Keras tutorial on "Image segmentation with a U-Net-like architecture" by François Chollet.
In your wandb.init pass sync_tensorboard=True to host your TensorBoard on Weights and Biases run page. This way you have the goodness of both W&B and TensorBoard.
💡
Here's a quick screenshot of our TensorBoard. If you'd like to view the TensorBoard hosted in the baseline W&B, go to our run page to visualize the model graph.

Setup
Let's quickly visit our GitHub repo and setup it up for our experiments.
# Install the repo%cd av-segmentation!pip install -e .# Install the dependencies!pip install -r requirements.txt# Authenticate your machine to use Weights and Biases# This will ask for your W&B authorization key. Visit wandb.ai/authorize.!wandb.login()
In our experiments, we are using:
- A single V100 GPU
- TensorFlow 2.9.x
Baseline
Train
We will train our baseline UNET model by running this line of code:
!python train_drivable.py --config configs/baseline.py --wandb --log_model --log_eval
- The training script will automatically fetch the data from the W&B tables.
- Everything is stitched using config files situated at the configs dir.
- Use --wandb to log the metrics to Weights & Biases. Check out the callback at drivable/callbacks/metric_logger.py.
- The --log_model uses tf.keras.callbacks.ModelCheckpoint for model checkpointing. The callback is modified to log the models to W&B Artifacts for version control. Check out the callback at drivable/callbacks/model_checkpoint.py.
- The --log_eval callback logs the model prediction per epoch as W&B Tables. Use this to debug your model better. The callback can be found at drivable/callbacks/wandb_eval_callback.py.
We'll soon see the usefulness of using these flags.
The features are instrumented with the GitHub repository and you can use it as an inspiration for instrumenting your own AV pipelines.
💡
We will be using cross-entropy as the choice of the loss function in our experiment and evaluating using the accuracy and IoU metrics. (It's worth noting that accuracy is not the best metric for segmentation tasks.)
The --wandb flag results in the metrics logged as shown below.
Run set
1
Our model checkpointing callback uses save_best=True and monitors the val_loss metric. You can control the arguments from the configs/baseline.py file.
Below are all the model checkpoints saved as a W&B artifact in the SavedModel format. We can select the best model from the versions based on a metric of our choice.
Run set
1
Test
Model version 7 has the lowest val_loss, so we'll use it to test on our hold-out test set. To do so, run the code snippet as shown below:
!python test_drivable.py --config configs/baseline.py --model_artifact av-team/drivable-segmentation/run_1krthafi_model:v7 --log_eval --wandb
- The --model_artifact flag takes the full name of the model artifact. In our case, it is av-team/drivable-segmentation/run_1krthafi_model:v7. You can find it in the overview section of the logged model artifact (Full Name), as shown below.
Run set
1
2. The --log_eval flag is used for logging the model prediction on the hold-out test set. Pass the --wandb flag since it's required for --log_eval.
The scalar charts shown below are the metrics generated by our evaluation job.
Run set
1
Metrics are all good, but they are not easily translated into anything physically meaningful. We need better tooling for visualizing the predictions of our model. Let's visualize the model prediction on the hold-out test set. Again, click on the ">" button to check the predictions on different samples.
Run set
1
Observations
- Generally speaking, the model can predict the drivable areas, but many things can be improved.
- The areas are not well defined. The boundary between drivable and alternative roads is not clear in some samples. The low IoU indicates this.
- The model can predict the difference between drivable and alternative roads well on the road with no vehicles. It's harder with more concepts in the frame (building, vehicles, etc.).
- The model struggles with two-way lanes where the lanes are not marked using lane markers or dividers, etc.
- The model is generally not struggling with day and night conditions; however, there's room for improvement. In some cases, the model is failing terribly. Using augmentations can help.
To summarize, it's a decent baseline. Few ways to improve the model would be:
- Improve the model. UNET is no more the state-of-the-art for segmentation tasks.
- We can use a better loss function like focal loss to regularize easy-to-segment classes.
- We can use augmentations.
Baseline with Sample Weights
Train
The obvious next step should be to use sample weights to regularize the model to deal with class imbalance. If you are doing a classification task, you can simply pass to the class_weight argument of Model.fit. This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
Similarly, segmentation problems can be treated as per-pixel classification problems, and you can deal with the imbalance problem by weighing the loss function to account for this.
But the class_weights argument does not support 3+ dimensional targets. You would therefore have to implement weights yourself. In addition to (image, mask) pairs, Model.fit also accepts (image, mask, sample_weight) triples. You can modify the dataloader to compute the sample_weight and return it along with the image and mask. All you need to do in our code is set use_sample_weight=True in the baseline.py config file.
Check out the drivable/data/drivable_dataloader.py to learn more about the weighting strategy used.
💡
Let's compare metrics with the baseline.
Run set
2
Test
We can see in the panels below that training the model with baseline configuration and sample weights gave a better performance for IoU. Lower accuracy is to be expected for the points mentioned above.
Run set
2
Hyperparameter Search
According to the plots above, our model is performing well but let us try to improve the performance with better hyperparameters. How can you do that? Sweeps! W&B Sweeps provide a great way to perform a hyperparameter search.
In this case, we will be using it to minimize the loss on the validation set while training a UNet model from scratch on the BDD100K dataset. Note that we are not using sample weights while running this sweep.
We ran the sweep to select the best dropout rate and the initial learning rate. Since we used SGD as the choice of the optimizer, we wanted to find the best momentum.
The Parallel Coordinates plot below shows how these hyperparameters affect the validation loss.
Run set
6
Observations
- There is not much effect of the dropout on the validation loss, but lower dropout rates <0.6 were still favorable.
- Interestingly, a larger learning rate gave the lowest validation loss. The parameter importance plot where learning rate and val_loss have a negative correlation gives a similar conclusion. This one is counter-intuitive but calls for better learning rate decay.
- Larger momentum suited better and is to be expected.
Pretraining on GTA5
In our story, we don't have access to more BDD100K samples, and it's expensive to acquire and annotate the data. We thus look at the synthetic dataset and set up a pretraining task with the aim that the model will learn the concepts like "where is the road?" or "what is not a road?".
The GTA 5 dataset we used for pretraining has annotations for 19 classes, but we preprocessed it to a binary mask where "0" is the background while "1" is the road.
Run the command below to download and train the dataset on the UNET model automatically.
!python pretrain_drivable.py --config configs/baseline.py --wandb --log_model --log_eval
Let's quickly look at our metrics.
Run set
1
The trained model is saved as a W&B artifact, which can be used for downstream domain adaption. Before that, let's look at the model's performance.
Observations
The model predicts the roads and background well; however, it sometimes fails to distinguish sidewalks from roads.
It would be interesting to see if our baseline performance improves if our domain adapts to the BDD100K dataset.
Domain Adaption
Domain adaptation is a standard practice in machine learning, where you fine-tune the pretrained model on your dataset. In our case, we call it domain adaption since synthetic/simulated datasets and natural world data are different domains.
Train
We select the best model from the pretraining stage and pass the artifact URL to our training script like this:
!python train_drivable.py --config configs/baseline.py --finetune av-team/drivable-segmentation/run_13va9unr_model:v12 --wandb --log_model --log_eval
Let's see if we can leverage our pre-trained model to domain adapt on the BDD100K dataset (our split). While fine-tuning, we usually use a different set of hyperparameters, as shown below:
- The batch size was changed to ensure the model fits in the memory.
- We are fine-tuning for only five epochs since it should converge quicker. We will look at the results soon.
- We have lowered the learning rate to 0.0001.
Run set
2
Let's look at our results:
Domain Adaption
2
Test
So how does the performance compare to the baseline models?
Run set
3
On seeing the panels above, we see that the model pre-trained on synthetic data followed by domain adaption learned faster and ended up with better performance than a model trained directly on the BDD100K dataset. Let us now look at the segmentation masks predicted by this model on our test set.
Run set
1
Conclusion
This report looked at how you can tackle the drivable area segmentation problem while working on autonomous vehicles. We formulated the problem as a specific case of the popular semantic segmentation problem.
We took two datasets: one comprised of real-world images and the other containing synthetic images from the popular game Grand Theft Auto 5. We performed EDA on these datasets and visualized what the actual masks looked like using W&B Tables.
Our starting point was training a simple model on real-world images, followed by finding the best hyperparameters for this using W&B Sweeps to minimize the loss on the validation set. We saw that the model performed pretty well.
The next step was to pre-train a model with the same architecture on the synthetic data and track the evolution of our model using W&B artifacts. For the domain adoption process, we pulled the best model from the previous step and fine-tuned it on the real-world data and found that it learns faster, saves training costs, and actually performs better! To train these models yourself, check out this repository on GitHub.
Related Reading
The Berkeley Deep Drive (BDD110K) Dataset
The BDD100K dataset is the largest and most diverse driving video dataset with 100,000 videos annotated for 10 different perception tasks in autonomous driving.
The ML Tasks Of Autonomous Vehicle Development
This report goes through the different tasks in the autonomous vehicle development lifecycle and the various machine learning techniques associated with them.
The Many Datasets of Autonomous Driving
Below we'll explore the datasets used to train autonomous driving systems to perform the various tasks required of them.
A System of Record for Autonomous Driving Machine Learning Models
A look at the most useful Weights & Biases features for autonomous vehicle companies
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.