Find Humans and Vehicles in Dashboard Scenes
Goal: precisely understand a human driver's view
Semantic segmentation for scene parsing
A self-driving car must functionally understand the road and its environment the way a human would from the driver's seat. One promising computer vision approach is semantic segmentation: parse visual scenes from a car dashboard camera into relevant objects (cars, pedestrians, traffic signs), foreground (road, sidewalk), and background (sky, building). Semantic segmentation annotates an image with object types, labeling meaningful subregions as a tree, bus, cyclist, etc. For a given car dashboard photo, this means labeling every pixel as belonging to a subregion.
Example segmentation maps >>>
You can see examples in two columns on the right, showing raw images, the model's predictions, and the correct labels—buildings are orange, car is pink, road is cobalt blue, and pedestrians are beige. In the left column, the model can't differentiate between a pedestrian and a rider on a bicycle (magenta and cyan in ground truth, beige in prediction). Note how the hazy conditions in the right column make the model predictions blurry around the boundaries between dashboard and road, or vehicle and road).
Reproduce & extend existing work
Objective
Train a supervised model to annotate dashboard-camera scenes at the per-pixel level into 20 relevant categories like "car", "road", "person", "traffic light".
Code: U-Net in fast.ai
I follow an excellent existing W&B project on semantic segmentation, summarized in this blog post by Boris Dayma. The starting model is a U-Net trained using fast.ai.
Dataset: Berkeley Deep Drive 100K
This model is trained on the Berkeley Deep Drive 100K dataset(BDD100K). For semantic segmentation, this contains 7K train, 1K validation, and 2K test labeled photos.
Findings so far
- The per-class accuracy of the existing model varies significantly: cars and traffic lights are detected with high confidence, while humans are detected very rarely (in part because they are relatively rare in the data—especially if you count by pixels, humans take up less visual space than buildings or road)
- Choice of encoder matters: Resnet seems to overgeneralize, while Alexnet picks up on too many irrelevant details. The ideal encoder is somewhere in between.
Ground truth
No data available. Please select runs that logged the key examples.
No data available. Please select runs that logged the key examples.