Detecting oranges with TorchVision

Neural networks trained to detect fruits in a two-class annotated image dataset
Created on June 24|Last edited on March 14
Comment
﻿
IntroductionAnchor box aspect ratioAnnotation sizesTraining metricsEvaluation metricsSample detectionsFaster RCNN (large inputs)RetinaNet (large inputs)Faster RCNN, large inputs, data aug.
﻿
IntroductionThis was part of a job interview I did a few years ago.
Framework used was TorchVision v0.10.0.
Two detection models were tested, Faster RCNN and RetinaNet. They were chosen due to being the best-maintained detection frameworks from TorchVision, as well as achieving higher average precision than alternative detection heads (such as SSD) and working well with images in high resolution.
MobilenetV3 was chosen as a backbone since it fits in my GPU memory (an NVIDIA GTX 1070 with 8 GB of VRAM).
Feature pyramids were included in the backbone to improve performance.
The backbones came pretrained from MSCOCO, possibly speeding up training.
The aspect ratios for anchor boxes were customized after realizing that object annotations were mostly squares.
Objects of classes labeled "Anomalia" were ignored.
The learning rate was decreased from −5⋅10−4-5 \cdot 10^{-4}−5⋅10−4﻿ to 10−510^{-5}10−5﻿ by multiplying it by 0.92470.92470.9247﻿ in 505050﻿ steps. It did not reach its minimum value since training was stopped early due to convergence of the model.
The dataset was split as follows:
Faster RCNN: 80 images for training, 6 for validation.
RetinaNet (large inputs) and Faster RCNN(large inputs): 78 images for training, 8 for validation.
Two data augmentation strategies were employed:
For the run named "Faster RCNN, large inputs, data aug.", training images were
were horizontally flipped with probability p=0.5p = 0.5p=0.5﻿;
were rotated in the range [−45∘,45∘][-45^\circ, 45^\circ][−45∘,45∘]﻿;
had their brightness multiplied by a factor in the range [0.65,1.35][0.65, 1.35][0.65,1.35]﻿.
In the other runs, only horizontal flipping with p=0.5p=0.5p=0.5﻿ was used.
Runs marked as "large inputs" were trained with images with resolution of 1920×28801920 \times 28801920×2880﻿. Otherwise, the resolution was 1024×15361024 \times 15361024×1536﻿.
Anchor box aspect ratioTo select the aspect ratio of anchor boxes for Faster RCNN, let's take a look at the aspect ratios of the object annotations in the provided dataset.
The `Dataset` class implemented for this dataset already computes aspect ratios, so we'll plot them on a histogram.
﻿
The dominant aspect ratio is 1:1, but to be safe, let's use anchor boxes of ratios 0.85, 1.0 and 1.15.
Annotation sizesIn case we need this information to customize our model, let's plot the actual sizes of the object annotations.
	w		h
count	2822		2822
mean	70.585796	70.769973
std	27.721963	28.791432
min	19.2288		19.152
25%	50.8191		49.248
50%	67.301		65.66405
75%	86.530138	87.552
max	234.8675	229.8239
﻿
﻿
Training metrics﻿
﻿
Evaluation metrics﻿
﻿
Sample detections
Faster RCNN (large inputs)﻿
﻿
RetinaNet (large inputs)﻿
﻿
Faster RCNN, large inputs, data aug.﻿
﻿
﻿
Add a comment