Skip to main content

Search and Rescue: Augmentation and Preprocessing on Drone-Based Water Rescue Images With YOLOv5

In this article, we look at achieving mAP scores of over 0.97 on large images with comparatively very small people, as used in drone-based water rescue.
Created on May 12|Last edited on February 7
Air and sea search and rescue (SAR) operations have been conducted the same way for decades. People use technology such as radar to get a general search area and then scan the terrain to try and find anything unusual.
More recently, unmanned aerial vehicles (UAVs) have been used for search and rescue operations. An important advantage of UAVs in search and rescue operations is that it is possible to detect small objects in a wide area when the victim is far away, difficult to find, and often in a dark environment.

U.S. Coast Guard Search and Rescue statistics for the 10 year period from 2007 to 2017 show that they took part in between 15,951 and 27,163 SAR operations and still, despite great sacrifice and effort, lost between 605 to 740 people each year. Obtained from https://www.bts.gov/content/us-coast-guard-search-and-rescue-statistics-fiscal-year


Modern UAV video cameras have very high resolution, and it can be hard to find a distant person in such a high-definition video with the human eye. Humans have the advantage of knowing the context of the image and looking where it is likely to find victims based on prior experiences. However, we can only focus on a small part of a given image.
Humans also get tired when faced with such a repetitive, eye-straining task; try and complete several “Where’s Waldo” books back-to-back, and you’ll likely become increasingly weary, slower, and more error-prone. Computer vision doesn’t suffer this particular drawback, and a well-trained model that can spot humans in either water or wilderness will perform with the same accuracy in the first 5 minutes as in the next 50 hours.
Helicopters and search drones are now equipped with cameras operating in various spectra, reflectors, marine automatic identification system (AIS) transponder receivers, tracking and navigation systems, and even cell phone detectors (Hodge et al., 2020).
While neural networks have exploded in popularity in many other computer vision tasks, they remain relatively unpopular in SAR applications due to the lack of large datasets for training the models (Lygouras et al., 2019).
Another issue is the sheer size of the search space and the number of images that must be searched. Apart from issues of exhaustion, humans also cannot scale as computers do. Like many other applications of AI and computer vision, training models to spot anomalies that the human eye might miss could be hugely effective in SAR operations where vast stretches of empty ocean or wilderness must be scoured for signs of life.
Reducing the time for search and rescue is important. Lives depend on it. In many cases, the survivability of the victim decreases exponentially over time, and fast models that are able to run without internet access – i.e., “edge AI” -- are a priority. The problem I aim to address is an improvement in the accuracy it takes to spot a missing person in large images while still using compact models able to be run on a mobile device such as a laptop.

Table of Contents



Results Summary:

My full results are discussed at length at the end of the project. However, I am very pleased to report that YOLOv5 is an excellent model for SAR imagery in water. Using mAP 0.5 as a metric that evaluates both localization and classification, the optimal medium model achieved an mAP score of 0.9727. The small model, trained using a dataset where the images had their hues randomly adjusted, achieved an mAP of 0.9703, and even the nano model achieved an mAP of 0.9658 with the hue preprocessing.
This has led to two conclusions:


The second conclusion is that randomly adjusting hue in color images as a preprocessing step is a viable and valuable technique to improve the training of these models. I guess that the uniform color of the water in the majority of the photos allows the model to rely a bit too much on shades of blue and green, indicating the lack of a person. By varying the hue during training, it can't just assume that large regions with similar textures do not have any people.

An example of how challenging spotting a person in a typical SAR image can be. This is a real helicopter photo taken in British Columbia, Canada, in late 2021. The hiker was successfully rescued after 3 days in the wilderness and suffering from hypothermia.

Project Goal and Approach

My project goals changed over time as I began investigating the task. From the start, my intention was to train a model to detect people in overhead drone shots where the people are much smaller relative to the size of the full image. I initially experimented with Faster R-CNN. However, I found that it was creating multiple bounding boxes for the same objects, and the implementation I was working with wasn't intuitive in extracting all the data I wanted. I ended up using YOLOv5 exclusively for my experimentation, as even the smallest nano size of it was able to achieve an mAP score of 0.9658 running on my laptop.
As a result of the excellent, speedy results from YOLOv5 coupled with its incredible integration with Weights & Biases for monitoring every step of the training and evaluation, I decided to instead focus on experimentation with model variations, preprocessing, and augmentations. My testing varied model size, image size, tiling, hue, shearing, greyscale, contrast stretching (normalization), histogram equalization, adaptive equalization, and newer techniques such as bounding box brightness and rotation.
I had intended to experiment with vision transformers and synthetic data, but that was neither really applicable in the end nor did it feel necessary. Vision transformers, as classifiers, could only indicate if a person or other manmade object was present in an image. At the same time, synthetic data was a more complex form of data augmentation that I found exceeded the scope of the project.
My plan includes the following to detect small objects better:
  • Using the largest resolution images that can still be run on a small device. Small objects may contain only a few pixels within the bounding box, so it's important to increase the richness of features that the detector can form from that small box. The trade-off is that this results in a larger model that takes longer to train and slower to run inference once complete.
  • Tiling images to zoom while still using smaller images effectively. As it turned out, I didn't have enough memory to run this with 1280x1280 images, as the dataset grew too large to fit into colab's memory. The version I did run, a tiled 640x640 model, had inferior performance to the 1280x1280 model since it had essentially the same data cut into four but could also miss objects cut in half by the tiling.
  • Filtering out unnecessary classes also can improve the classification of objects. As I note later, I started with the full 6 classifications and moved to the 2 class data.
  • I am not varying the hyperparameters of YOLOv5 other than experimenting with preprocessing, augmentation, model sizes, and input resolution. I've attached the list of hyperparameters used as an appendix at the end of this report.

Dataset: AFO - Aerial dataset of floating objects

I chose the freely available AFO - Aerial dataset of floating objects, which is on Kaggle. https://www.kaggle.com/datasets/jangsienicajzkowy/afo-aerial-dataset-of-floating-objects
Jan Gąsienica-Józkowy, Mateusz Knapik, and Bogusław Cyganek from the AGH University of Science and Technology in Krakow, Poland, released the AFO dataset in 2021. The notes from their repository read as follows:
• AFO dataset is the first free dataset for training machine learning and deep learning models for maritime search and rescue applications. It contains aerial-drone videos with 40,000 hand-annotated persons and objects floating in the water, many of small size, which makes them difficult to detect. • The AFO dataset contains images taken from fifty video clips containing objects floating on the water surface, captured by the various drone-mounted cameras (from 1280 x 720 to 3840 x 2160 resolutions). From these videos, 3,647 images that contain 39,991 objects were extracted and manually annotated. These have been split into three parts: the training (67.4% of objects), the test (19.12% of objects), and the validation set (13.48% of objects). In order to prevent overfitting of the model to the given data, the test set contains selected frames from nine videos that were not used in either the training or validation sets.

Examples of annotated objects from the AFO dataset. This summary shows three samples per category, and was included with the dataset itself.
As stated above, the dataset contains large images ranging from 1280 x 720 to 3840 x 2160 in size. This is larger than any of the YOLOv5 models require, as they were trained on 640 x 640 images for the YOLOv5- models and 1280 x 1280 for the YOLOv5-6 models. While this sounds like a doubling in size, it is an increase from 409,600 pixels to 1,638,400, a quadrupling. The size of the inputs, mAP values, and speeds are listed in the table below.
It is worth noting most importantly that, YOLOv5 is small. A a weights file for YOLOv5s6 which natively accepts 1280 x 1280 images is 25.8 megabytes. Weights file for YOLOv4 using the Darknet architecture much larger, at 244 megabytes. This means YOLOv5 is 89.43% smaller than YOLOv4, making it ideal for use in embedded devices, tablets, phones, or laptops. For SAR imagery which could take place in the wilderness or on a helicopter, this is crucial.
The various YOLOv5 models. YOLOv5- models were trained on 640 x 640 images and YOLOv5-6 were trained on 1280 x 1280.

Dataset Properties

As mentioned, the AFO dataset contains large images with very small things to detect. Over 99% of the objects have a surface area smaller than 1% of the entire image area. It also contains a lot of crowded images, where more than 30% of the images contain more than 20 instances of objects. This is by design, as the dataset is intended for the development and verification of small object detection.
The dataset as well has 3 partitions of classes, although I only made use of 2 of them for this project.

6 Class Partition

The first partition contains six different categories:
  • human
  • surfboard
  • boat
  • buoy
  • sailboat
  • kayak
The intention is to see how well detectors can detect actual humans vs. other floating objects. The issue here is that, just like in the real world, humans outnumber everything else, creating a data imbalance where humans represent 80% of the objects.
The human class (class 0) vastly outnumbers the boat and buoy classes (classes 1 to 5), and the image sizes are significantly larger than YOLOv5 can make use of natively.
The majority of objects are centered in the image, and nearly half of the images have between 2 and 11 objects, more than the other categories.
Examples from the dataset with 6 classes as predicted by the YOLOv5s6 model. Raw photo is on the left, ground truth in the center, and predictions on the right.

2 Class Partition

As a way of correcting this imbalance, and based on the idea that most maritime SAR operations prioritize the detection of the object over assigning it a specific category, there is also a two-class version of the dataset. Humans and buoys are categorized into a single "small objects" class, while surfers, boats, sailboats, and kayaks are in a second "large object" category. This two-class dataset is more balanced. This made the dataset slightly more balanced and also prepared it to train models for typical SAR operations, where people are either searching for people or boats.

Examples from the dataset with 2 classes as predicted by the YOLOv5m6 model. Raw photo is on the left, ground truth in the center, and predictions on the right.

Training Setup

I used Google Colab Pro+ and Google Drive for storage. I’ve previously used this service before for a large NLP project, and the available GPUs are frequently Tesla V100s. It also allows background processing for up to 24 hours. I have 2 terabytes of space available on Google Drive, which is more than enough for the approximately 10 gigabytes of data needed by the datasets.
After running 40 model runs on Weights & Biases, I've used 32 gigabytes of space and 4 full days of compute on an NVIDIA Tesla V100 GPU with 640 tensor cores. NVIDIA claims these cards are equivalent to 100 CPUs, so if it hadn't been for Google Colab I'd have been heating my house with a laptop for a month.

Plenty of room for more!


Baselines

Baseline 1280 x 1280 Images with no augmentation, stretched to a square aspect ratio

Resizing images is a critical preprocessing step for any computer vision task. Fundamentally, machine learning models train faster on smaller images. Input images twice the dimensions require models to learn from four times as many pixels, which adds precious time to the prediction task. Furthermore, images come in varying sizes, and many models require all inputs to be the same size.
When converting the large, rectangular AFO images to squares – which YOLOv5 requires – we have to choose between maintaining the original aspect ratio and padding the newly resized image or stretching the image. This distorts it but fills the entire square of the desired dimensions. For my purposes, I decided to stretch it; however, this comes with the caveat that the model requires similarly stretched images to give decent predictions. That is to say because we're teaching the model with stretched little people, we have to stretch the people we want to detect the same way.
It was my strong belief that I would need to use the largest images possible to detect tiny people, so I started with the 1280x1280 image model.
An example code for resizing images by stretching is below. First, we show the original image:
import os
from skimage import io
afo_image = io.imread('/Users/iankelk/Google Drive/Computer Vision/r4_100.jpg')

print('The image object is ' + str(type(afo_image)))
print('The pixel values are of type ' + str(type(afo_image[0,0,0])))
print('Shape of image object = ' + str(afo_image.shape))
fig, ax = plt.subplots( figsize=(12, 12))
_=ax.imshow(afo_image)
Output of the above code snippet to load and display the image and dimensions.
Now we resize and stretch the image.
image_resized = resize(afo_image, (1280, 1280), anti_aliasing=True)
print('The image object is ' + str(type(afo_image)))
print('The pixel values are of type ' + str(type(afo_image[0,0,0])))
print('Shape of image object = ' + str(afo_image.shape))
fig, ax = plt.subplots( figsize=(12, 12))
_=ax.imshow(image_resized)

Output of the above code snippet to resize and reshape the image.
We apply this to our entire dataset:






Baseline 1280 x 1280 Images with no augmentation, stretched to square aspect ratio

Due to the increased memory requirements of using larger images with four times the number of pixels, I was only able to run the YOLOv5-6 models up to the medium size.





Baseline Conclusions








Dataset Augmentations and Preprocessing

There are many ways to combine preprocessing and augmentations of datasets to improve results. I've mainly done them one at a time to isolate their effects, but of course, they can be combined as well. YOLOv5 uses several augmentation methods, including one called "mosaic," which I will address after the ones I did myself.
Preprocessing is the set of steps to prepare images before they are used by model training and inference and includes basic elements such as resizing, orientation, and color corrections. Similarly, image augmentation is a transformation applied to images to create different versions of similar content to expose the model to a wider range of training examples. Examples of augmentations are random rotations, contrast, and scale, which attempt to simulate other possible views of detected objects.
Image augmentation manipulations are also forms of image preprocessing with a twist. While image preprocessing steps are applied to training and test sets, image augmentation is only applied to the training data. Thus, a transformation that could be an augmentation in some situations may best be a preprocessing step in others.
The one preprocessing step that improved results over the baseline was hue randomization.

Contrast Adjustments: Histogram Equalization, Adaptive Equalization, and Normalization

This type of preprocessing boosts contrast based on the image's histogram to improve normalization and line detection in varying lighting conditions.

Local Histogram Equalization

This example adjusts an image's contrast using local histogram equalization, which spreads out the most frequent intensity values in an image. An example from the dataset and the code to reproduce it is shown here.
def plot_image(img, title='Equalized Image'):
'''Function plots the complete image with default title of Equalized Image'''
fig, ax = plt.subplots( figsize=(6, 6))
_=ax.imshow(img)
ax.set_title(title, fontsize = 16)
plt.show()
def plot_image_distribution(img):
'''Function plots histograms of an image along with the cumulative distribution'''
fig, ax = plt.subplots(1,2, figsize=(12, 5))
ax[0].hist(img.flatten(), bins=50, density=True, alpha=0.3)
ax[0].set_title('Histogram of image')
ax[0].set_xlabel('Pixel value')
ax[0].set_ylabel('Density')
ax[1].hist(img.flatten(), bins=50, density=True, cumulative=True, histtype='step')
ax[1].set_title('Cumulative distribution of image')
ax[1].set_xlabel('Pixel value')
ax[1].set_ylabel('Cumulative density')
plt.show()

def test_equalize(img):
# Equalize each color channel separately then restack them.
img_equalized = np.multiply(np.dstack((exposure.equalize_hist(img[:,:,0]),
exposure.equalize_hist(img[:,:,1]),
exposure.equalize_hist(img[:,:,2]))), 255).astype(np.uint8)
plot_image(np.divide(img_equalized, 255.0))
plot_image_distribution(np.divide(img_equalized, 255.0))
return img_equalized

afo_image_equalized = test_equalize(image_resized)
Local histogram equalization example, along with the corresponding density and cumulative density plots.
Applying this to our AFO dataset as a whole, we get the following:


We can conclude from this that histogram equalization is not a useful preprocessing step for the AFO data.

Adaptive Histogram Equalization

Similarly to histogram equalization, we can preprocess the images with adaptive histogram equalization as well. Adaptive histogram equalization is an algorithm for local contrast enhancement that uses histograms computed over different tile regions of the image. Local details can therefore be enhanced even in regions that are darker or lighter than most of the image (taken from the skimage documentation )
def test_equalize(img):
img_equalized = np.multiply(np.dstack((exposure.equalize_adapthist(img[:,:,0]),
exposure.equalize_adapthist(img[:,:,1]),
exposure.equalize_adapthist(img[:,:,2]))), 255).astype(np.uint8)
plot_image(np.divide(img_equalized, 255.0))
plot_image_distribution(np.divide(img_equalized, 255.0))
return img_equalized

afo_image_equalized = test_equalize(image_resized)
Adaptive equalization example, along with the corresponding density and cumulative density plots



As with histogram equalization, we can conclude that adaptive equalization is also not a useful preprocessing step to take for the AFO data.

Contrast Normalization

In image processing, normalization is a process that changes the range of pixel intensity values. Applications include photographs with poor contrast due to glare, for example. Normalization is sometimes called contrast stretching or histogram stretching. In more general fields of data processing, such as digital signal processing, it is referred to as dynamic range expansion. 
As before, I try out contrast normalization as my final contrast adjustment preprocessing experiment.
import cv2 as cv

def test_normalize(img):
norm = np.zeros(img.shape)
img_normalized = cv.normalize(img, norm, 0, 255, cv.NORM_MINMAX)
plot_image(np.divide(img_normalized, 255.0))
plot_image_distribution(np.divide(img_normalized, 255.0))
return img_normalized

afo_image_normalized = test_normalize(image_resized)
Contrast normalization example, along with the corresponding density and cumulative density plots


As with histogram and adaptive equalization, we can conclude that contrast normalization is also not a useful preprocessing step for the AFO data. The only metric it appears marginally better is precision in the small model, but otherwise, it hinders performance.

Tiling Preprocessing

Tiling is an interesting technique for small object detection. While it effectively zooms the detector in on tiny objects, it also carries the risk that objects will be cut in half by the technique, with half appearing in one image and half in the other. Thus, depending on the model, it could be detected twice or not at all.
It does provide the additional advantage of allowing us to keep the small input resolution we need in order to run fast inference. Of course, if the model is trained on tiled images, it will need to use tiled images for inference as well, which requires additional image processing time. It should be noted that this data was trained on the 6-class version of the dataset and should be compared only with it.
Here is a sample of the code to tile the images:
import cv2
import math

img_shape = image_resized.shape
tile_size = (640, 640)
offset = (640, 640)

fig, ax = plt.subplots(2,2, figsize=(12, 12))
plt.tight_layout()
ax = ax.flatten()
img_index = 0
for i in range(int(math.ceil(img_shape[0]/(offset[1] * 1.0)))):
for j in range(int(math.ceil(img_shape[1]/(offset[0] * 1.0)))):
cropped_img = image_resized[offset[1]*i:min(offset[1]*i+tile_size[1], img_shape[0]), \
offset[0]*j:min(offset[0]*j+tile_size[0], img_shape[1])]
ax[img_index].imshow(cropped_img)
img_index += 1

The code sample has tiled the 1280 x 1280 square image into four 640 x 640 images. Note that part of the boat has been cut off in the lower images on the border between them, leading to somewhat poor predictions for this experiment.

The tiled dataset breakdown as created from the stretched 1280x1280 images. Due to the tiling of 2x2, the size of the dataset has quadrupled in size.


The experiment did not produce good results for the tiling due to the smaller sizes of the 640-tiled images compared to the 1280 images containing the same information at the same resolution. The trade-off in speed is offset by the fact that the dataset is now four times as large. It's possible that given the ability to train a tiled model using four 1280 tiled images could beat the current best results. However, that was impossible due to a dataset of 8,000 images of size 1280 x 1280 requiring too much memory to be runnable in Colab, and certainly would not be viable for use during SAR missions.

Greyscaling

Since colored images are generated from mixes of red, green, and blue in various proportions, color can add unnecessary complexity to the neural model if they're not needed for object detection. Instead of doing a single convolution on just one array, the model must perform convolutions on three distinct arrays. As greyscale images are more computationally efficient, it might be a better question to ask: when do we need color? This is a tricky question, and no answer's always correct. Color might be needed when textures are uniform, and the removal of color removes features. As we see later, using hue augmentation is beneficial to the model, so greyscale may not be the most useful step here.
greyscale_image = rgb2gray(image_resized)
print('The image object is ' + str(type(greyscale_image)))
print('The pixel values are of type ' + str(type(greyscale_image[0,0])))
print('Shape of image object = ' + str(greyscale_image.shape))
fig, ax = plt.subplots( figsize=(12, 12))
_=ax.imshow(greyscale_image, cmap=plt.get_cmap('gray'))
Our resized image rendered as grayscale.


Grayscale was a very poor choice for this use case. It's very likely that the color of the water is crucial for identifying objects floating in it, and this is evident by a fairly large drop in performance when grayscale images are used for training.

Shear Augmentation

Random shearing is a common data augmentation technique in which the images have random affine transformations with only the shear component activated, changing the position it takes in the frame. Any time such a transformation is applied, the accompanying labels have to be adjusted as well. Random shearing is particularly useful, as it changes the angles that objects have relative to each other during training. It's possible that during the training data collection, there was some bias in the position of the camera that doesn't reflect real-world uses and angles. With shearing augmentation, this can be addressed without collecting more data. As well it can combat overfitting by increasing variation to prevent the model from memorizing the training data.
import skimage.transform as transform
# Rotation is 0
theta = 0
c, s = np.cos(theta), np.sin(theta)
shear = np.pi/6
R = np.array(((c, -np.sin(theta+shear), 0), (s, np.cos(theta+shear), -0), (0,0,1)))

transformed_img = transform.warp(image_resized, R)
plot_image(transformed_img)
An application of a shear of π/6\pi/6 being applied to the same resized image as above
Since I used shearing as an augmentation, it doubled the size of the dataset:




Run set
5

As can be seen in the plots, shearing augmentation - at least at a random level of +/-15% - is not effective with this data.

Hue Augmentation



Hue is measured radially, where colors are adjusted by some positive or negative degrees. Once in HSV space, I chose +/-100º to get a good variety of colors and really let those mind-altering substances fly. Thus when computing the hue, I need to choose a number from between -100 and 100 randomly. The hue-augmented dataset has the same size as the previous shear-augmented dataset.
import cv2
import numpy as np
import random

img_float32 = np.float32(image_resized)
# Convert the RGB image to HSV
hsv_image = cv.cvtColor(img_float32, cv.COLOR_RGB2HSV)
# Get a random delta between -100 and 100
delta = np.random.randint(low=-100, high=100, size=1)[0]
# Modify the hue by that quantity and convert it back
hsv_image[:,:,0] = np.mod(hsv_image[:,:,0] + delta, 360.)
rgb_image = cv.cvtColor(hsv_image, cv.COLOR_HSV2RGB)
plot_image(rgb_image)





Bounding Box Augmentation


Run set
40

Bounding box augmentation added brightness of +/- 25% and rotation of +/-10º to bounding box annotations
Three examples of bounding box augmentation from the augmented dataset using 3rd party tools at https://www.aquariumlearning.com/


This method of augmentation delivered the worst results yet. It could be due to the lack of texture variation against the water leading the brightness changes to worsen the performance or simply a lack of experimentation with the many options available.


Continuation and Final Results

As this report has become very long and there are still a few things to showcase, I've placed the final results into a second report.

Minhajul Hoque
Minhajul Hoque •  
Really great article Ian. It really shows the hard work and time you put into this. I especially appreciate the fact that you tried many different augmentations and clearly showcased the results. I had one question: 1. It is very surprising that the tiling augmentation performs much worse. Although 640px tiles are smaller than 1280px images, the objects in the tiles represent a higher % of the image. What would be your intuition/understanding on why the tiling augmentation performed worse?
Reply
Sarah Alsharif
Sarah Alsharif •  
The most helpful article I've read on Yolov5, it's really eye-opening, and you really put a lot of effort into it, THANKS A LOT!!
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.