Find Humans and Vehicles in Dashboard Scenes

Semantic segmentation on the Berkeley Deep Drive 100K dataset
Created on January 10|Last edited on February 4
Comment
﻿
Goal: precisely understand a human driver's view
﻿
﻿
﻿
Semantic segmentation for scene parsingA self-driving car must functionally understand the road and its environment the way a human would from the driver's seat. One promising computer vision approach is semantic segmentation: parse visual scenes from a car dashboard camera into relevant objects (cars, pedestrians, traffic signs), foreground (road, sidewalk), and background (sky, building). Semantic segmentation annotates an image with object types, labeling meaningful subregions as a tree, bus, cyclist, etc. For a given car dashboard photo, this means labeling every pixel as belonging to a subregion.
Example segmentation maps >>>You can see examples in two columns on the right, showing raw images, the model's predictions, and the correct labels—buildings are orange, car is pink, road is cobalt blue, and pedestrians are beige. In the left column, the model can't differentiate between a pedestrian and a rider on a bicycle (magenta and cyan in ground truth, beige in prediction). Note how the hazy conditions in the right column make the model predictions blurry around the boundaries between dashboard and road, or vehicle and road).
Reproduce & extend existing workObjectiveTrain a supervised model to annotate dashboard-camera scenes at the per-pixel level into 20 relevant categories like "car", "road", "person", "traffic light".
Code: U-Net in fast.aiI follow an excellent existing W&B project on semantic segmentation, summarized in this blog post by Boris Dayma. The starting model is a U-Net trained using fast.ai.
Dataset: Berkeley Deep Drive 100K This model is trained on the Berkeley Deep Drive 100K dataset(BDD100K). For semantic segmentation, this contains 7K train, 1K validation, and 2K test labeled photos.
Findings so farThe per-class accuracy of the existing model varies significantly: cars and traffic lights are detected with high confidence, while humans are detected very rarely (in part because they are relatively rare in the data—especially if you count by pixels, humans take up less visual space than buildings or road)
Choice of encoder matters: Resnet seems to overgeneralize, while Alexnet picks up on too many irrelevant details. The ideal encoder is somewhere in between.
Ground truth
best 20% data
swift-leaf-122
Index0 59
No data available. Please select runs that logged the key examples.
No data available. Please select runs that logged the key examples.
Run set2
﻿
Which objects matter most? Comparing per-class accuracies
﻿
﻿
﻿
Run set3
﻿
Comparing encoder variants
﻿
﻿
﻿
Run set2
﻿
First Experiments: Increase Weight Decay, Decrease Learning Rate
﻿
﻿
﻿
 
All manual runs107
First manual sweep10
﻿
Hyperparameter Sweep Insights
﻿
﻿
﻿
Run set398
﻿
Manual vs Automated Sweeps
﻿
﻿
﻿
Runs by sweep398
﻿
﻿
Add a comment