Semantic Segmentation for Self Driving Cars

Boris Dayma

Self-driving cars require a deep understanding of their surroundings. To support this, camera frames are used to recognize the road, pedestrians, cars, and sidewalks at a pixel-level accuracy. In this project, we develop a neural network and optimize it to perform semantic segmentation using and a dataset from Berkeley Deep Drive.

U-net networks are highly efficient for semantic segmentation. A library such as is appealing because it has built-in features to quickly create a U-net network and replace the contracting path (left half of below image) with a pre-trained network to take advantage of transfer learning. Several helper functions also conveniently implement advanced policies and training methodologies. This let us quickly iterate on different variants on the architecture and focus on optimization.

Typical U-net architecture

The Berkeley Deep Drive dataset is highly diverse and presents labeled segmentation data from various cars, in multiple cities and different weather conditions. This is essential for self-driving cars as it improves model performance across weather conditions and helps avoid overfitting and unnecessary regularization.

When we have too few samples or when images are too similar to each other, a neural network tends to focus too much on the data it has been trained on and will not generalize well to unseen data (leading to a high validation loss). This can be limited through regularization, but it also comes with negative impact on training loss. A richer and more diverse dataset avoids running in this situation and will improve both training and validation losses.

It is important to note that this dataset is not perfect--it is difficult to label this type of data.

The segmented data is an image where each class is represented by a different integer. The labeled pixels go from 0 to 18, representing unique classes (road, car, pedestrian, sign, train…) while the unlabeled pixels are set to 255. We pre-process the data so that unlabeled pixels are set to 19, leading to 20 unique classes represented by consecutive integers, which is a more standard representation of segmented data.

When calculating our accuracy, we ignore the unlabeled pixels (class “19”).

We first start with a smaller input resolution of 256x256 and test a lot of variants to quickly get insights on what is effective.

A few observations from this first batch of runs:

Finally, it seems reasonable to aim for 90% accuracy when using larger images.

When respecting the original dimension ratio of the pictures, 320x180 is not enough to get good predictions, so we end up using 640x360.

While we quickly achieve 89% accuracy, it is surprisingly difficult to reach 90%. We finally succeed with longer training and learning rate adjustments.

We also try to train in 2 phases:

While this improves the training loss, the validation accuracy doesn’t change, suggesting that we are overfitting and may need more data or regularization.

Finally it is interesting to observe our predictions and how they evolve while training. The network quickly identifies car hoods and road as they cover a larger portion of the image, leading quickly to a high accuracy. It then slowly refines the contours and learns to identify the other classes (pedestrians, cars…).

This suggests that some improvements are required on the loss:

My first approach would be to define a F-beta score for each class (same as the Dice coefficient except we include a “beta” factor). The generalized formula is defined as:

Recall & precision are defined below:

Adjusting beta let us give more importance towards precision or recall. For example, I would use a high beta (>1) for the class corresponding for pedestrians, giving more importance to recall so that we do not miss any. On the other hand, I would give a low beta (

The loss needs to be differentiable to calculate its gradient and minimize it. We approximate true positive as the sum of the products of labels with the logits. Also we consider the union of two sets as their sum, even if we count twice the same elements. After a few simplifications, it leads us to the following expression:

We add a small term to numerator and denominator to handle numerical stability and cases where no detected pixel or ground truth pixel is present. It is important to note that we need to maximize this value so the loss could be defined as its opposite.

Finally, to ensure the network does not focus too much on the class covering the largest portion of the image (such as the road or the car hood), the total score is defined as a sum of each individual score:

where the weights 𝛼 are adjusted to give more or less importance to each class.

As a first trial, I would keep all the weights equal which means we attribute the same importance to each class, independently of the portion of the image they cover. Then I may want to use some intuition to define the importance of each class (for example pedestrians are probably more important than banners) and adjust the weights based on results observed.

At an extreme level, those weights could even depend on other parameters:

The most successful models are often the ones with the most simple and straightforward structure so I would definitely use a loss with equal weights on each class as my baseline.

Feel free to experiment with the model! For further details, refer to:

Join our mailing list to get the latest machine learning updates.