Detecting oranges with TorchVision
Neural networks trained to detect fruits in a two-class annotated image dataset
Created on June 24|Last edited on March 14
Comment
IntroductionAnchor box aspect ratioAnnotation sizesTraining metricsEvaluation metricsSample detectionsFaster RCNN (large inputs)RetinaNet (large inputs)Faster RCNN, large inputs, data aug.
Introduction
This was part of a job interview I did a few years ago.
- Framework used was TorchVision v0.10.0.
- Two detection models were tested, Faster RCNN and RetinaNet. They were chosen due to being the best-maintained detection frameworks from TorchVision, as well as achieving higher average precision than alternative detection heads (such as SSD) and working well with images in high resolution.
- MobilenetV3 was chosen as a backbone since it fits in my GPU memory (an NVIDIA GTX 1070 with 8 GB of VRAM).
- Feature pyramids were included in the backbone to improve performance.
- The backbones came pretrained from MSCOCO, possibly speeding up training.
- The aspect ratios for anchor boxes were customized after realizing that object annotations were mostly squares.
- Objects of classes labeled "Anomalia" were ignored.
- The learning rate was decreased from to by multiplying it by in steps. It did not reach its minimum value since training was stopped early due to convergence of the model.
- The dataset was split as follows:
- Faster RCNN: 80 images for training, 6 for validation.
- RetinaNet (large inputs) and Faster RCNN(large inputs): 78 images for training, 8 for validation.
- Two data augmentation strategies were employed:
- For the run named "Faster RCNN, large inputs, data aug.", training images were
- were horizontally flipped with probability ;
- were rotated in the range ;
- had their brightness multiplied by a factor in the range .
- In the other runs, only horizontal flipping with was used.
- Runs marked as "large inputs" were trained with images with resolution of . Otherwise, the resolution was .
Anchor box aspect ratio
To select the aspect ratio of anchor boxes for Faster RCNN, let's take a look at the aspect ratios of the object annotations in the provided dataset.
The `Dataset` class implemented for this dataset already computes aspect ratios, so we'll plot them on a histogram.

The dominant aspect ratio is 1:1, but to be safe, let's use anchor boxes of ratios 0.85, 1.0 and 1.15.
Annotation sizes
In case we need this information to customize our model, let's plot the actual sizes of the object annotations.
w hcount 2822 2822mean 70.585796 70.769973std 27.721963 28.791432min 19.2288 19.15225% 50.8191 49.24850% 67.301 65.6640575% 86.530138 87.552max 234.8675 229.8239

Training metrics
Evaluation metrics
Sample detections
Faster RCNN (large inputs)
RetinaNet (large inputs)
Faster RCNN, large inputs, data aug.
Add a comment