How are your YOLOv5 models doing?

Why is bounding box debugging a necessity?. Made by Ayush Chaurasia using Weights & Biases
Ayush Chaurasia

YoloV5, An Introduction

On 9 June 2020, Glenn Jocher of Ultralytics released YOLOv5. Glenn introduced PyTorch based version of YOLOv5 with exceptional improvements. This version is pretty amazing and outperforms all the previous versions in terms of COCO AP and got near to EfficientDet AP with higher FPS. You can notice that in the graph below. 1_MS_sC3rpdyOGSJF8rwoJxA.png As opposed to previous flavors that used the DarkNet framework, the YOLOv5 implementation has been done in Pytorch.


Let's look at an example comparing Faster RCNN and YoloV5.

1_Rz3RJ4ymhKDEto4XhMF_rw.gif Source

1_ZfMbCrDiXlKfAdn2rST0vg.gif Source


Run Speed of YOLO v5 small(end to end including reading video, running model and saving results to file) — 52.8 FPS!

Run Speed of Faster RCNN ResNet 50(end to end including reading video, running model and saving results to file) — 21.7 FPS

Training YOLO-V5 on a custom dataset

In this experiment, we're working with a custom dataset containing ~2250 images and the goal is to perform helmet and mask detection. The process of training yolov5 on any custom data is quite simple.

1. Create Dataset.yaml

# train and val data as 1) directory: path/images/, 2) file: path/images.txt, or 3) list: [path1/images/, path2/images/]
train: data/custom/
val: data/custom
# number of classes
nc: 2
# class names
names: ['helmet', 'mask']

2. Train a model

The repository provides 4 pre-defined models to choose from. Yolov5-small, Yolov5-medium, Yolov5-large, Yolov5-extraLarge.

For this experiment, all the 4 models have been trained on the same for 100 epochs, with W&B logging enabled.

Metrics don't tell the full story

There are various metrics used to measure the accuracy and performance of an object detection model – precision, recall, and mAP at various levels just to name a few. We can even combine these metrics to form a new average metric to judge the performance. But none of these metrics capture the whole picture because, as sometimes judging the performance is a subjective task. Consider this scenario:

In the context of a high-risk situation, such as a model to be deployed in a self-driving car, both of these models can prove to be fatal. But a scenario such face detection for applying filters in mobile devices, you can trade accuracy for the model size of easier deployment. Another reason why metrics don't capture the whole story is that these evaluations are done for fixed thresholds for class id probability, IOU, and objectness confidence, all of which can be tuned for each model to get very different results.

Thus, these metrics are good for establishing a baseline, or to monitor whether training( or optimization) is actually happening but beyond that, judging the model performance is a subjective task, completely dependent upon the task at hand and the environment in which the model is supposed to be deployed. To make this more concrete, let us look at the metrics and then at the results.

Section 6

Why is bounding box debugging a necessity?

The charts in the last section show the models compare in terms of metrics. Now, let us look at the actual predictions captured iteratively throughout the training process by W&B logger.

Section 8