Object Detection for Autonomous Vehicles (A Step-by-Step Guide)

Digging into object detection and perception for autonomous vehicles using YOLOv5 and Weights & Biases
Created on September 9|Last edited on August 13
Comment
In this report, we’ll take you through an object detection workflow for autonomous vehicles in Weights & Biases. More specifically, you'll learn how to create a baseline object detection model with YOLOv5, improve it with continued experimentation (including selecting our highest performing backbone architecture and tuning our hyperparameters), analyze it with some common metrics, and identity which candidate model is our best performer, all in W&B. 
Spoiler alert: We assumed the best performing model would be the largest but in fact, we were wrong. In fact, our smaller model beat the larger model by a significant degree. This makes it superior in a real-time inference
Let’s get going. 
Exploring our Dataset with Tables﻿Berkeley Deep Drive 100K Dataset (BDD100K) is a collection of video data for heterogeneous multitask learning. Unsurprisingly, it contains 100,000 videos and they come from more than 50,000 individual rides. That variety is key as it allows us to have diverse scenes we want our model to understand–city streets, rural backroads, and highways, as well as a variety of weather and light conditions. BDD100K can be used for a sizeable portion of typical AV modeling (think lane detection, instance segmentation, etc.), but today, we’ll be using it for model detection. 
﻿
﻿
First, let’s get our data. To do this, we’ll use W&B Artifacts, which makes it really easy and convenient to store and version our autonomous driving datasets. Creating a new version of the dataset and fetching a particular version takes only a couple of lines of code: 
# Create Artifact
with wandb.init(﻿)﻿:
    artifact = wandb.Artifact(﻿'bdd100k-yolov5'﻿, type﻿=﻿'dataset'﻿)
    artifact.add_dir(dataset_path)
    wandb.log_artifact(artifact)`
﻿
# Fetching Artifact
with wandb.init(﻿)﻿:
    artifact = run.use_artifact(﻿'av-team/bdd100k-perception/bdd100k-yolov5:latest'﻿, type﻿=﻿'dataset'﻿)
    artifact_dir = artifact.download(﻿)
﻿
We're hosting a subset of the BDD100K dataset with object-detection annotations converted to a format that is compatible with training using the YOLOv5 framework by Ultralytics. Here's how that dataset looks as an Artifact:
﻿
BDD100K Dataset Artifact12
﻿
We can also use Tables in our W&B workspace to visualize and explore our images and segmentation labels. Specifically, we'll dig into our subset of data to find out exactly what's in there. We'll quickly analyze the frequency distribution of the annotation labels using Custom Plots by Weights & Biases.  This can be valuable for all sorts of things���finding over- or under-represented classes, for example. In fact, below, you'll see our model won't do well with trains as we have no examples in this particular selection. 
These Tables are fully interactive, so feel free to explore. For example, you can click on any of the images in the column labeled Image-BBox column below and toggle both the bounding box and semantic segmentation corresponding to each image (we have a quick example gif at the end of this section as well). 
Below are graphs of class frequencies, you'll see a few Tables. The first contains our dataset, with object detection and segmentation labels, as well as annotations for weather, time of day, and scene. Below that, you'll see BDD100K grouped by weather conditions, scene, and time of day. 
﻿
﻿
Run set133
﻿
﻿
﻿
Analysis and Exploration of Segmentation and Detection labels from the BDD100k Dataset 1
﻿
﻿
﻿
Demo of Interactive Exploration of Segmentation and Bounding Box Annotations1
﻿
﻿
Learning ObjectiveNow that we’ve examined our dataset and its distribution, let’s look into what our model is supposed to learn. Here, we're interested in bounding box annotations corresponding to all objects of interest (such as vehicles, pedestrians, obstacles, etc.) present in a given frame of a video or camera feed. Our model not only needs to predict all the bounding box coordinates but also label them as one of the following classes:
bike 🚲
bus 🚎
car 🚗
motorcycle 🏍
person 🧍
rider 🚴‍♀️
traffic light 🚦
traffic sign 🛑
train 🚝
truck 🚚
Baseline ExperimentsFor establishing our baseline experiments, we decided to use the YOLOv5 family of object detection architectures and models pre-trained on the MsCOCO dataset developed by Ultralytics. We performed the baseline experiments using all the variations of the P5 and P6 versions of YOLOv5 models introduced in release v5.0 of the framework using the default set of hyperparameters for 5 epochs each. 
﻿
﻿
We used Weights & Biases to not only track our experiments but also debug the implementation correction of our pipeline, and the annotation format of our dataset, and check the improvement (or lack thereof) in the performance of our models on the validation set while training the model. Some relevant metrics follow: 
﻿
﻿
YOLOv5 Family of Models1
﻿
Training a YOLOv5 model using Weights & Biases is quite simple. We can just use the YOLOv5 training CLI: 
python train.py \
	--img 640 \ # Image Size
	--batch 32 \ # Batch Size
	--epochs 5 \ # Number of Epochs
	--data bdd.yaml \ # Dataset Specifications File (Refer: https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data#11-create-datasetyaml)
	--weights weights/yolov5x6.pt \ # YOLOv5 Weight file pre-trained on MSCOCO 
	--project bdd100k-perception \ # Weights & Biases Project
	--entity av-team \ # Weights & Biases Entity
	--name yolov5x6-baseline \ # Name of Experiment (optional)
	--cache ram # Cache images in RAM for improvement in training speed
﻿
Comparing YOLO Flavors in W&BLet's compare which of our YOLO models performs best. To do so, we'll look at a variety of common metrics. As follows: 
Precision basically attempts to answer the question "What proportion of positive identifications was actually correct?" For a model that produces no false positives, the precision would be 1.0.
Recall attempts to answer the following question "What proportion of actual positives was identified correctly?" For a model that produces no false negatives, the precision would be 1.0.
Average Precision (or AP) is a way to summarize the precision-recall curve into a single value representing the average of all precisions.
Mean Average Precision or mAP is basically the mean of the average precision of all the classes.
﻿
﻿
Baseline Experiments12
﻿
Based on the results from the baseline experiments, we can see that the larger the model is, the better it learns. However, the high number of parameters and FLOPs of the larger YOLO models make them difficult and expensive to deploy in production for a real-life autonomous driving agent.
So, the question that we ask ourselves is
Is it possible to improve the performance of the smaller models somehow by choosing an optimal set of hyperparameters?﻿
Tuning Hyperparameters with W&B SweepsIn order to improve the performance of the baseline model while not making it too big and expensive for real-time inference in a production environment, we need to not only select the best model, but also the best set of hyperparameters to train it with. This, in spite of being quite a daunting task, was actually made quite easy for us by Sweeps, a scalable and customizable hyperparameter search and optimization engine by Weights & Biases. 
Sweeps made it extremely easy for us to employ a Bayesian hyperparameter search method with the goal to maximize the mean Average Precision (mAP) of the model on the validation dataset. Sweeps not only helped us arrive at the most optimal set of hyperparameters for our final set of experiments but also provided us with additional insights with respect to the hyperparameter optimization process in terms of the correlations and the importance of different hyperparameters employed in the experiments.
Note that we removed the largest models yolov5l, yolov5l6, yolov5x and yolov5xn from the Sweep since our objective is to improve the performance of the smaller YOLO models while not making it too big and expensive for real-time inference in a production environment.
💡
﻿
﻿
Hyperparameter Tuning using Weights & Biases Sweep2
﻿
For guidance on how to run a Weights & Biases Sweep for YOLOv5 models, you can refer to this official guide.
💡
Key Insights from the SweepThe parameters Warmup Initial Momentum, Warmup Initial Learning Rate for Bias, and Image Mixup Probability (augmentation) have strong positive correlations with Mean Average Precision or mAP.
The parameters Image Horizontal Flip probability (augmentation), Image Scale (in terms of gain), and Image Shear (in terms of degrees) have strong negative correlations with Mean Average Precision or mAP.
There is a strong positive correlation between Mean Average Precision and the number of parameters in the model which is evident from the fact that most of the runs resulting in the best value for mAP have a yolov5m or yolov5m6 backbone.
Given the bounding box predictions compared between yolov5m and yolov5m6:
yolov5m6 performed better performance in terms of detecting and correctly classifying the objects of interest compared to yolov5m.
As noted above, yolov5m tends to miss most of the objects in a scene even when a low threshold for class score (< 0.3) is set.
However, yolov5m6 has more parameters compared to yolov5m which might be expensive in terms of memory in a production environment.
yolov5m has a slightly lower FLOPs count compared to yolov5m6 which might make it more effective in terms of real-time inference in a production environment.
﻿
Final ExperimentsGiven these insights from our Sweeps, it's still a bit difficult to choose a final model to be deployed into production. But one way to finalize the candidate for production is to train both models for a longer period of time (we only trained for five epochs during our baseline experiments and hyperparameter optimizations) with the best set of hyperparameters for each backbone respectively:
﻿
Final Experiments2
﻿
Some Observations from Our Final Experimentsyolov5m is a clear winner over yolov5m6 in terms of mAP. We use Model Registry by Weights & Biases as a central system to record our models with appropriate versioning and relevant tags.
yolov5m6 beats yolov5m in terms of precision by a very narrow margin.
yolov5m again beats yolov5m6 in terms of recall by a huge margin.
The superiority of yolov5m becomes even more evident when we analyze the Bounding Box Predictions. Using the same threshold for the class score, yolov5m6 not only has a tendency to miss out on objects entirely but also misclassifies detected objects.
yolov5m is a lot smaller than yolov5m6 in terms of the number of parameters and hence would be the perfect model to be deployed in the production environment for real-time inference.
﻿
project("av-team", "bdd100k-perception").artifact("YOLOv5-BDD100k")
YOLOv5-BDD100kVersion 3
All Versions
Aliases
latest
prod
staging
sweep
v5m
v5m6
Versions
v3
v2
v1
v0
VersionMetadataUsageFilesLineage
Version overview
Full Name
av-team/bdd100k-perception/YOLOv5-BDD100k:v3
Aliases
latest
prod
v3
v5m
Tags
Digest
b918942be327797a25ab48baee91142a
Source Version
run_3i47x3uf_model:v0
Created By
yolov5m-final
Created At
September 9th, 2022 10:25:56
Num Consumers
0
Num Files
1
Size
42.2MB
TTL Remaining
Inactive
Description
﻿
What NextIf we wanted to improve the performance of our model, here are a few steps we might take:
Collect more data for edge cases:
We observed that the models explored in this sprint have a tendency to miss out on objects which appear small in the frame, for example, vehicles far away on the horizon.
We observed that the models explored in this sprint have a tendency to miss out on objects which don't appear small in the frame, for example, traffic signs that have a color similar to the background and traffic lights that might be partially cascaded by vehicles from the camera perspective.
We also observed that the models explored in this sprint have a tendency to misclassify objects which appear similar, for example,
trucks that are further away from the camera are often classified as cars
trucks, buses, and trains are often misclassified.
Experiment with Focal Loss in order to tackle the imbalance in the distribution of high-priority classes in our dataset.
Experiment with a model that has a feature pyramid network to efficiently detect objects at multiple scales. This would be especially useful to detect objects which appear small in the frame.
﻿
ConclusionThis report touched on an object detection pipeline for autonomous vehicles, walking through the major features of Weights & Biases. We used Artifacts to pull and store our dataset, Tables to visualize and explore our data, Experiments to compare our models and select our best performer, Sweeps to iterate and improve on our model's accuracy, and Reports to collate and share our findings (in fact, you're reading a Report right now). 
For more exploration and research on Autonomous Vehicles and YOLO, feel free to check out any of the following. And thanks for reading!
Training Semantic Segmentation Models for Autonomous Vehicles (A Step-by-Step Guide)
A short tutorial on leveraging Weights & Biases to train a semantic segmentation model for autonomous vehicles. 
A System of Record for Autonomous Driving Machine Learning Models
A look at the most useful Weights & Biases features for autonomous vehicle companies
Scaling Out Motion Prediction for Autonomous Vehicles with L5Kit, Ray, and W&B 
In this tutorial, we'll show you how we easily organized and instrumented a prediction model for autonomous vehicle motion with W&B and scaled it out with Ray.
Semantic Segmentation: The View from the Driver's Seat
This article explores semantic segmentation for scene parsing on Berkeley Deep Drive 100K (BDD100K) including how to distinguish people from vehicles. 
﻿
﻿
Add a comment
Tags: Articles, Berkeley Deep Drive, Advanced, Autonomous Vehicles, Experiment, Tutorial, Sweeps, Plots, Panels, Slider
Iterate on AI agents and models faster. Try Weights & Biases today.