Semantic Segmentation: The View from the Driver's Seat
This article explores semantic segmentation for scene parsing on Berkeley Deep Drive 100K (BDD100K) including how to distinguish people from vehicles.
Created on February 5|Last edited on November 28
Comment
In this article, we take a look at the task of semantic segmentation for scene parsing on the Berkeley Deep Drive 100K dataset and explore how to distinguish people from vehicles.
Table of Contents
Understand a dashboard scene with semantic segmentationExample segmentation mapsReproduce & extend existing workHow do we compare vehicles to people?Comparing per-class accuraciesEncoders: Resnet too broad, Alexnet too detailed or too blockyFirst Encoders: Resnet vs Naive AlexnetImproved Encoders: Alexnet vs Resnet with High Human IOUFirst experiments: increase weight decay, decrease learning rateFirst Experiments: Increase Weight Decay, Decrease Learning RateHyperparameter Sweep InsightsHyperparameter Sweep InsightsInsights from SweepsComparing manual and automated sweeps
Understand a dashboard scene with semantic segmentation
A self-driving car must functionally understand the road and its environment the way a human would from the driver's seat. One promising computer vision approach is semantic segmentation: parse visual scenes from a car dashboard camera into relevant objects (cars, pedestrians, traffic signs), foreground (road, sidewalk), and background (sky, building). Semantic segmentation annotates an image with object types, labeling meaningful subregions as a tree, bus, cyclist, etc. For a given car dashboard photo, this means labeling every pixel as belonging to a subregion.
Below you can see examples in two columns: raw images, the model's predictions, and the ground truth (correct labeling). Buildings are orange, cars are pink, road is cobalt blue, and pedestrians are beige. In the left column, the model can't differentiate between a pedestrian and a rider on a bicycle (magenta and cyan in ground truth, beige in prediction). Note how the hazy conditions in the right column make the model predictions blurry around the boundaries between dashboard and road, or vehicle and road).
Example segmentation maps
Run set
2
Reproduce & extend existing work
Objective
Train a supervised model to annotate dashboard-camera scenes at the per-pixel level into 20 relevant categories like "car", "road", "person", "traffic light".
Code: U-Net in fast.ai
I follow an excellent existing W&B project on semantic segmentation, summarized in this blog post by Boris Dayma. The starting model is a U-Net trained using fast.ai.
Dataset: Berkeley Deep Drive 100K
This model is trained on the Berkeley Deep Drive 100K dataset(BDD100K). For semantic segmentation, this contains 7K train, 1K validation, and 2K test labeled photos.
Findings so far
- The existing model's performance varies significantly by class: cars and traffic lights/posts are detected with near 90% accuracy, while humans are detected very rarely (in part because they are relatively rare in the data—especially if you count by pixels, humans take up less visual space than buildings or road)
- Encoder choice matters: Resnet misses some crucial details (like humans, or distinguishing bikers vs pedestrians), while Alexnet is bimodal, either hyperfocusing on small patches of color or parsing the images in large blocks. The ideal encoder is somewhere in between and requires more tuning.
- MeanIoU (intersection over union) may be a more helpful metric than accuracy for improving human detection
How do we compare vehicles to people?
Great on cars — but humans are incredibly hard to detect
Factoring out the accuracies per class (car, traffic sign/light, human) shows us how well the model identifies different components of a driving scene.
While it performs well on cars and traffic signs/lights, it detects barely any humans, especially when measuring by mean accuracy (percentage of human-containing pixels correctly identified) as opposed to mean IoU. One thing to try next is filtering BDD100K to train/test only on examples that contain humans.
Intersection over Union
For all of the original model variants tried, the human (pedestrian or rider) accuracy is a flat line at 0. It's possible that the mean computed is extremely low because the humans take up such a small fraction of the pixels in an image. To get more signal, I tried logging a common metric in semantic segmentation: intersection over union (good definition and more clear intuition, though here we are dealing with any contiguous collection of pixels, not strictly rectangular boxes, as subregions of the image). In the later model variants, IoU reveals which models are detecting humans, with the highest human IoU so far reaching 0.01758, compared to the highest mean IoU of 0.7997 and best overall accuracy of 0.8873. A perfect IoU is 1: where the correct pixel subregion and the predicted pixel subregion match exactly so their intersection is equal to their union.
IoU more helpful and perhaps less biased than accuracy
Below you can see that in four of the best models (one color each), car detection accuracy (all the solid lines) is generally better than traffic sign/light/pole accuracy (all the dashed lines). The overall average accuracy (all the dotted lines) measures every object class except "void", for 19 total. It is better than traffic accuracy but worse than car accuracy, likely because cars are some of the most frequent and largest objects while traffic poles and signs are much smaller and less frequent. Although the "best human iou so far" model is less impressive based on accuracy, optimizing for human iou appears to yield a more balanced model across classes—the variance across the car and traffic prediction metrics is half of what it is for the other three models.
Comparing per-class accuracies
Four top models
4
Encoders: Resnet too broad, Alexnet too detailed or too blocky
The panel below shows the difference between two early variants of the model based on the U-Net encoder: Resnet-18 (representative predictions in left column of example panel on the right) and an Alexnet variant (right column) tried in a hyperparameter sweep.
First encoders
The Alexnet model picks up on too many details, parsing the individual windows on the buildings and shadow segments on the car as separate objects classes. The Resnet model is generally more accurate, but it makes mistakes in broader patches, such as merging a car and truck identification in the bottom right, or hallucinating patches of car and building in the overpass (note, this pale blue is labeled as "void", or not a class of interest like "wall" or "building", in the ground truth). Note that other differences between the models (namely learning rate and number of training stages) could explain this discrepancy.
Alexnet after tuning: Finds humans but blocky
From this naive Alexnet, I ran a longer hyperparameter sweep using Bayesian optimization to improve IoU as the objective metric. I also tracked human-specific iou in these runs. The top performing runs by this metric are mostly Alexnets, generally with lower learning rates and more training than the fir. While these have much higher IoU and accuracy than the initial Alexnets I tried, they parse the image in a blocky pattern (see the "Improved Encoders" section below). This yields unrealistic segmentation for most regions (straight lines where there should be curves). However, these blocks seem much better for actually finding humans, as illustrated below. On human IoU specifically, Alexnet outperforms Resnet by an order of magnitude, though of course at the expense of precision. The ideal encoder would balance human recall with precision (crisp outlines instead of big vague blocks) and requires further tuning.

First Encoders: Resnet vs Naive Alexnet
Early sweep runs
2
Improved Encoders: Alexnet vs Resnet with High Human IOU
High human iou models
10
First experiments: increase weight decay, decrease learning rate
After cloning the repo and verifying that the code runs, I tried varying weight decay and learning rate. These are grouped as "First manual sweep".
- increasing the weight decay 5X improved the final accuracy by 9%
- decreasing the learning rate may also be promising—increasing it is not a good strategy
- increasing the batch size beyond 9 leads to a CUDA OOM error (reasonable batch sizes are very different for semantic segmentation compared to convolutional nets for image classification)
The initial experiments were running on a tiny fraction of the data (1%) and may not be representative. Select "All manual runs" below to see the effect of increasing the fraction to 20% and 100%. Note that the accuracy can vary by over 10% for a fixed pairing of learning rate and weight decay.
First Experiments: Increase Weight Decay, Decrease Learning Rate
First manual sweep
10
107
Hyperparameter Sweep Insights
Learning rate: decrease
Lower learning rates appear to correlate with higher accuracy—could investigate in more detail.
Training stages: keep low
Increasing beyond 2-3 doesn't seem to do much (check starter code).
Weight decay: inconclusive
Initially increasing the weight decay 5X improved the accuracy by 9%. Increasing by 200X causes the same amount of improvement though.
Hyperparameter Sweep Insights
All runs
398
Insights from Sweeps
I compared the average accuracy of my manual sweep (purple) with two random search sweeps (blue and red) and a bayesian sweep trying to maximize human detection accuracy (gray). The automated sweeps have higher variance, finding lots of inferior combinations but also a few surprisingly superior combinations.
Objective metric matters
I tried a sweep with Bayesian optimization to maximize IoU (green). This ran the longest and yielded some of the best new models and ideas to try. In particular, it found hyperparameter combinations that started to detect humans, and suggested that Alexnet was worth revisiting.
Note: dataset size varies substantially (gray and green sweeps: ~1400 train, ~100 otherwise)
Next steps
- increase validation data size (default to training on 20%)
- decrease learning rate
- decrease weight decay
- increase batch size as much as possible (8?)
- evaluate training stages
- filter dataset to images with humans only
- explore encoder variants that balance Alexnet advantages for finding humans with the better overall precision of Resnet
- custom loss prioritizing human detection
Comparing manual and automated sweeps
Runs by sweep
398
Add a comment
fafa
Reply
accuracy not good
Reply
For all of the original model variants tried, the human (pedestrian or rider) accuracy is a flat line at 0. It's possible that the mean computed is extremely low because the humans take up such a small fraction of the pixels in an image. To get more signal, I tried logging a common metric in semantic segmentation: intersection over union (good definition and more clear intuition, though here we are dealing with any contiguous collection of pixels, not strictly rectangular boxes, as subregions of the image). In the later model variants, IoU reveals which models are detecting humans, with the highest human IoU so far reaching 0.01758, compared to the highest mean IoU of 0.7997 and best overall accuracy of 0.8873. A perfect IoU is 1: where the correct pixel subregion and the predicted pixel subregion match exactly so their intersection is equal to their union.
Reply
@stacey : Can you share the code for this particular work along with the wand instructions..
2 replies
Iterate on AI agents and models faster. Try Weights & Biases today.