Skip to main content

Find Humans and Vehicles in Dashboard Scenes

Semantic segmentation on the Berkeley Deep Drive 100K dataset
Created on January 10|Last edited on February 4

Goal: precisely understand a human driver's view





Semantic segmentation for scene parsing

A self-driving car must functionally understand the road and its environment the way a human would from the driver's seat. One promising computer vision approach is semantic segmentation: parse visual scenes from a car dashboard camera into relevant objects (cars, pedestrians, traffic signs), foreground (road, sidewalk), and background (sky, building). Semantic segmentation annotates an image with object types, labeling meaningful subregions as a tree, bus, cyclist, etc. For a given car dashboard photo, this means labeling every pixel as belonging to a subregion.

Example segmentation maps >>>

You can see examples in two columns on the right, showing raw images, the model's predictions, and the correct labels—buildings are orange, car is pink, road is cobalt blue, and pedestrians are beige. In the left column, the model can't differentiate between a pedestrian and a rider on a bicycle (magenta and cyan in ground truth, beige in prediction). Note how the hazy conditions in the right column make the model predictions blurry around the boundaries between dashboard and road, or vehicle and road).

Reproduce & extend existing work

Objective

Train a supervised model to annotate dashboard-camera scenes at the per-pixel level into 20 relevant categories like "car", "road", "person", "traffic light".

Code: U-Net in fast.ai

I follow an excellent existing W&B project on semantic segmentation, summarized in this blog post by Boris Dayma. The starting model is a U-Net trained using fast.ai.

Dataset: Berkeley Deep Drive 100K

This model is trained on the Berkeley Deep Drive 100K dataset(BDD100K). For semantic segmentation, this contains 7K train, 1K validation, and 2K test labeled photos.

Findings so far

  • The per-class accuracy of the existing model varies significantly: cars and traffic lights are detected with high confidence, while humans are detected very rarely (in part because they are relatively rare in the data—especially if you count by pixels, humans take up less visual space than buildings or road)
  • Choice of encoder matters: Resnet seems to overgeneralize, while Alexnet picks up on too many irrelevant details. The ideal encoder is somewhere in between.
Ground truth

No data available. Please select runs that logged the key examples.

No data available. Please select runs that logged the key examples.

Run set
2


Which objects matter most? Comparing per-class accuracies




Run set
3


Comparing encoder variants




Run set
2


First Experiments: Increase Weight Decay, Decrease Learning Rate




All manual runs
107
First manual sweep
10


Hyperparameter Sweep Insights




Run set
398


Manual vs Automated Sweeps




Runs by sweep
398