Skip to main content

Part II: Search and Rescue: Augmentation and preprocessing on drone-based water rescue images with YOLOv5

Achieving mAP scores of over 0.97 on large images with comparatively very small people in them
Created on May 14|Last edited on June 7

Mosaics Augmentation

Mosaics were conceived in 2020 and first released in YOLOv4, and they are an integral part of YOLOv5, so it was not something I had to implement myself. Mosaics work in a way similar to tiling, by four source images and combining them together into one.
This achieves multiple techniques of augmentation in a single step.
Specifically, it creates four random crops while maintaining the relative scale of your objects compared to the image, which improves the model in cases of occlusion (objects hidden behind another object) or translation. It also combines combinations of classes that may not have previously been detected together in the same image. For instance in the highly unbalanced AFO dataset, it could create data with all the image classes in a single image.
Finally, it creates variance in the number of objects in the images. As shown earlier, the majority of the AFO images have between 2 and 11 objects in them. With mosaic augmentation, that could mean between 8 and 44.
YOLOv5 created mosaics as part of its training, and here are a few of them:




Plots of Top Performing Models




Conclusions

YOLOv5 is an excellent model for use with SAR imagery in water

Using mAP 0.5 as a metric which evaluates both localization and classification, the optimal medium model achieved a mAP score of 0.9727. The best small model, trained using a dataset where the images had their hues randomly adjusted, achieved a mAP of 0.9703, and even the nano model achieved a mAP of 0.9658 with the hue preprocessing.

Image resolution is more important than model size



As I had initially expected, the 1280 input models outperform every size of the 640 input models. This is understandable, as at 640x640 resolution many of the objects to detect were only a few pixels in size.

YOLOv5 small (YOLOv5s6) and nano (YOLOv5n6) could work well in portable devices

On my MacBook Pro without a GPU, the nano is able to run inference, ie. detect people in images, in only 535 ms, or about half a second. While this is about 16 times too slow to analyze every frame of a video at 30 FPS, drones aren't moving so quickly that people in water would be in and out of frame in half a second. The nano model is also tiny: it's only 7.3 MB. It's also very likely that a purpose build small computing device could outperform my laptop. NVIDIA makes a small, powerful computer called the Jetson Nano that can run multiple neural networks in parallel and consumes only 5 watts.
The small model, while performing better than the small model, also takes 2.5-3 times longer to perform inference, and so it might not be as ideal depending on the needed trade-off of speed vs result quality. As can be seen below, the medium model is significantly slower.
I didn't have a Jetson Nano to test the speeds on, however I tested it on my laptop and Google Colab with and without the GPU enabled. My laptop is a M1X MacBook Pro with 64 GB of RAM, Google Colab runs on an Intel Xeon E5-2673 v4 chip with 13 GB of RAM, and the NVIDIA Tesla V100 GPU is a monster of a GPU with 16 GB of RAM and barely registers on the chart below.
The YOLOv5 models are also significantly smaller than the previous iteration, YOLOv4. The YOLOv5n6, YOLOv5s6, and YOLOv5m6 models are 7.3 MB, 25.8M, and 69.0M respectively.



Hue preprocessing improves the model for people in water

The second conclusion is that randomly adjusting hue in color images as a preprocessing step is a viable and valuable technique to improve the training of these models. My guess is that the uniform color of the water in the majority of the photos allows the model to rely a bit too much on shades of blue and green indicating the lack of a person, and by varying the hue during training it can't just assume that large regions with similar texture do not have any people in them.



Appendix A: Additional Charts

These include additional loss plots, learning rates, and GPU utilization for the top 13 models I ended up using.











Appendix B: YOLOv5 Notebook

This is the meat of the code I used to run the model in its many variations. This is largely modified off the freely available demos that do this; I did not write the majority of this code.

Step 1: Installation

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
print('Not connected to a GPU')
else:
print(gpu_info)

#clone YOLOv5 and
!git clone https://github.com/ultralytics/yolov5 # clone repo
%cd yolov5
%pip install -qr requirements.txt # install dependencies
%pip install -q roboflow

import torch
import os
from IPython.display import Image, clear_output # to display images

print(f"Setup complete. Using torch {torch.__version__} ({torch.cuda.get_device_properties(0).name \
if torch.cuda.is_available() else 'CPU'})")


Step 2: Connect to Google Drive

# Used for running in Google Colab
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Step 3: Train the YOLOv5 model

# Weights & Biases to track everything. I used v0.12.10 as the current version would crash for me
%pip install -q wandb==0.12.10
#%pip install -q wandb
import wandb
wandb.login()

!python train.py --img 1280 --batch 8 --epochs 150 --data /content/drive/MyDrive/datasets/data.yaml --weights yolov5m6.pt \
--cache --upload_dataset --bbox_interval 10
Here is a small example of the training output:


Step 4: Evaluate YOLOv5 Detector Performance

# Start tensorboard
# Launch after you have started training, logs save in the folder "runs"
%load_ext tensorboard
%tensorboard --logdir runs


Run Inference with Trained Weights

!python detect.py --weights runs/train/exp/weights/best.pt --img 1280 --conf 0.1 --source \
/content/drive/MyDrive/datasets/test/images
A sample of some of the inference output


Step 5: Display detections on images

#display inference on ALL test images

import glob
from IPython.display import Image, display

for imageName in glob.glob('/content/yolov5/runs/detect/exp/*.jpg'): #assuming JPG
display(Image(filename=imageName))
print("\n")
A sample of the output


Appendix C: Hyperparameters


Run set
4


Appendix D: References

I used Weights & Biases https://wandb.ai/ extensively to track the model training and generate visualizations. I also used examples from the class notes and homework assignments for the preprocessing steps.
Hodge VJ, Hawkins R, Alexander R. Deep reinforcement learning for drone navigation using sensor data. Neural Com- puting and Applications. 2020; 1–19. https://link.springer.com/article/10.1007/s00521-020-05097-x
Lygouras E, Santavas N, Taitzoglou A, Tarchanidis K, Mitropoulos A, Gasteratos A. Unsupervised human detection with an embedded vision system on a fully autonomous UAV for search and rescue operations. Sensors. 2019 08; 19: 3542. https://www.mdpi.com/1424-8220/19/16/3542
Barret Zoph, Ekin D. Cubuk, Golnaz Ghiasi, Tsung-Yi Lin, Jonathon Shlens, Quoc V. Le, Learning Data Augmentation Strategies for Object Detection https://arxiv.org/abs/1906.11172
Small Target Detection for Search and Rescue Operations using Distributed Deep Learning and Synthetic Data Generation (Patch and synthetic data reference): https://arxiv.org/abs/1904.11619
Search and Rescue with Airborne Optical Sectioning https://arxiv.org/abs/2009.08835
Rethinking Drone-Based Search and Rescue with Aerial Person Detection https://arxiv.org/abs/2111.09406
An Autonomous Drone for Search and Rescue in Forests using Airborne Optical Sectioning https://arxiv.org/abs/2105.04328
UAV-Based Search and Rescue in Avalanches using ARVA: An Extremum Seeking Approach https://arxiv.org/abs/2106.14514
Weights & Biases YOLOv5 Documentation: https://docs.wandb.ai/guides/integrations/yolov5