Part II: Search and Rescue: Augmentation and preprocessing on drone-based water rescue images with YOLOv5
Achieving mAP scores of over 0.97 on large images with comparatively very small people in them
Created on May 14|Last edited on June 7
Comment
Mosaics Augmentation
Mosaics were conceived in 2020 and first released in YOLOv4, and they are an integral part of YOLOv5, so it was not something I had to implement myself. Mosaics work in a way similar to tiling, by four source images and combining them together into one.
This achieves multiple techniques of augmentation in a single step.
Specifically, it creates four random crops while maintaining the relative scale of your objects compared to the image, which improves the model in cases of occlusion (objects hidden behind another object) or translation. It also combines combinations of classes that may not have previously been detected together in the same image. For instance in the highly unbalanced AFO dataset, it could create data with all the image classes in a single image.
Finally, it creates variance in the number of objects in the images. As shown earlier, the majority of the AFO images have between 2 and 11 objects in them. With mosaic augmentation, that could mean between 8 and 44.
YOLOv5 created mosaics as part of its training, and here are a few of them:
Plots of Top Performing Models
Conclusions
YOLOv5 is an excellent model for use with SAR imagery in water
Using mAP 0.5 as a metric which evaluates both localization and classification, the optimal medium model achieved a mAP score of 0.9727. The best small model, trained using a dataset where the images had their hues randomly adjusted, achieved a mAP of 0.9703, and even the nano model achieved a mAP of 0.9658 with the hue preprocessing.
Image resolution is more important than model size
As I had initially expected, the 1280 input models outperform every size of the 640 input models. This is understandable, as at 640x640 resolution many of the objects to detect were only a few pixels in size.
YOLOv5 small (YOLOv5s6) and nano (YOLOv5n6) could work well in portable devices
On my MacBook Pro without a GPU, the nano is able to run inference, ie. detect people in images, in only 535 ms, or about half a second. While this is about 16 times too slow to analyze every frame of a video at 30 FPS, drones aren't moving so quickly that people in water would be in and out of frame in half a second. The nano model is also tiny: it's only 7.3 MB. It's also very likely that a purpose build small computing device could outperform my laptop. NVIDIA makes a small, powerful computer called the Jetson Nano that can run multiple neural networks in parallel and consumes only 5 watts.
The small model, while performing better than the small model, also takes 2.5-3 times longer to perform inference, and so it might not be as ideal depending on the needed trade-off of speed vs result quality. As can be seen below, the medium model is significantly slower.
I didn't have a Jetson Nano to test the speeds on, however I tested it on my laptop and Google Colab with and without the GPU enabled. My laptop is a M1X MacBook Pro with 64 GB of RAM, Google Colab runs on an Intel Xeon E5-2673 v4 chip with 13 GB of RAM, and the NVIDIA Tesla V100 GPU is a monster of a GPU with 16 GB of RAM and barely registers on the chart below.
The YOLOv5 models are also significantly smaller than the previous iteration, YOLOv4. The YOLOv5n6, YOLOv5s6, and YOLOv5m6 models are 7.3 MB, 25.8M, and 69.0M respectively.

Hue preprocessing improves the model for people in water
The second conclusion is that randomly adjusting hue in color images as a preprocessing step is a viable and valuable technique to improve the training of these models. My guess is that the uniform color of the water in the majority of the photos allows the model to rely a bit too much on shades of blue and green indicating the lack of a person, and by varying the hue during training it can't just assume that large regions with similar texture do not have any people in them.
Appendix A: Additional Charts
These include additional loss plots, learning rates, and GPU utilization for the top 13 models I ended up using.
Appendix B: YOLOv5 Notebook
This is the meat of the code I used to run the model in its many variations. This is largely modified off the freely available demos that do this; I did not write the majority of this code.
Step 1: Installation
import tensorflow as tfprint("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))gpu_info = !nvidia-smigpu_info = '\n'.join(gpu_info)if gpu_info.find('failed') >= 0:print('Not connected to a GPU')else:print(gpu_info)

#clone YOLOv5 and!git clone https://github.com/ultralytics/yolov5 # clone repo%cd yolov5%pip install -qr requirements.txt # install dependencies%pip install -q roboflowimport torchimport osfrom IPython.display import Image, clear_output # to display imagesprint(f"Setup complete. Using torch {torch.__version__} ({torch.cuda.get_device_properties(0).name \if torch.cuda.is_available() else 'CPU'})")

Step 2: Connect to Google Drive
# Used for running in Google Colabfrom google.colab import drivedrive.mount('/content/drive')
Mounted at /content/drive
Step 3: Train the YOLOv5 model
# Weights & Biases to track everything. I used v0.12.10 as the current version would crash for me%pip install -q wandb==0.12.10#%pip install -q wandbimport wandbwandb.login()

!python train.py --img 1280 --batch 8 --epochs 150 --data /content/drive/MyDrive/datasets/data.yaml --weights yolov5m6.pt \--cache --upload_dataset --bbox_interval 10
Here is a small example of the training output:

Step 4: Evaluate YOLOv5 Detector Performance
# Start tensorboard# Launch after you have started training, logs save in the folder "runs"%load_ext tensorboard%tensorboard --logdir runs

Run Inference with Trained Weights
!python detect.py --weights runs/train/exp/weights/best.pt --img 1280 --conf 0.1 --source \/content/drive/MyDrive/datasets/test/images
A sample of some of the inference output

Step 5: Display detections on images
#display inference on ALL test imagesimport globfrom IPython.display import Image, displayfor imageName in glob.glob('/content/yolov5/runs/detect/exp/*.jpg'): #assuming JPGdisplay(Image(filename=imageName))print("\n")
A sample of the output

Appendix C: Hyperparameters
Run set
4
Appendix D: References
I used Weights & Biases https://wandb.ai/ extensively to track the model training and generate visualizations. I also used examples from the class notes and homework assignments for the preprocessing steps.
Hodge VJ, Hawkins R, Alexander R. Deep reinforcement learning for drone navigation using sensor data. Neural Com- puting and Applications. 2020; 1–19. https://link.springer.com/article/10.1007/s00521-020-05097-x
Lygouras E, Santavas N, Taitzoglou A, Tarchanidis K, Mitropoulos A, Gasteratos A. Unsupervised human detection with an embedded vision system on a fully autonomous UAV for search and rescue operations. Sensors. 2019 08; 19: 3542. https://www.mdpi.com/1424-8220/19/16/3542
Barret Zoph, Ekin D. Cubuk, Golnaz Ghiasi, Tsung-Yi Lin, Jonathon Shlens, Quoc V. Le, Learning Data Augmentation Strategies for Object Detection https://arxiv.org/abs/1906.11172
AFO dataset: https://www.kaggle.com/jangsienicajzkowy/afo-aerial-dataset-of-floating-objects
Small Target Detection for Search and Rescue Operations using Distributed Deep Learning and Synthetic Data Generation (Patch and synthetic data reference): https://arxiv.org/abs/1904.11619
Rethinking Drone-Based Search and Rescue with Aerial Person Detection https://arxiv.org/abs/2111.09406
An Autonomous Drone for Search and Rescue in Forests using Airborne Optical Sectioning https://arxiv.org/abs/2105.04328
UAV-Based Search and Rescue in Avalanches using ARVA: An Extremum Seeking Approach https://arxiv.org/abs/2106.14514
Add a comment