Skip to main content

Detecting oranges with TorchVision

Neural networks trained to detect fruits in a two-class annotated image dataset
Created on June 24|Last edited on March 14


Introduction

This was part of a job interview I did a few years ago.
  • Framework used was TorchVision v0.10.0.
  • Two detection models were tested, Faster RCNN and RetinaNet. They were chosen due to being the best-maintained detection frameworks from TorchVision, as well as achieving higher average precision than alternative detection heads (such as SSD) and working well with images in high resolution.
  • MobilenetV3 was chosen as a backbone since it fits in my GPU memory (an NVIDIA GTX 1070 with 8 GB of VRAM).
  • Feature pyramids were included in the backbone to improve performance.
  • The backbones came pretrained from MSCOCO, possibly speeding up training.
  • The aspect ratios for anchor boxes were customized after realizing that object annotations were mostly squares.
  • Objects of classes labeled "Anomalia" were ignored.
  • The learning rate was decreased from 5104-5 \cdot 10^{-4} to 10510^{-5} by multiplying it by 0.92470.9247 in 5050 steps. It did not reach its minimum value since training was stopped early due to convergence of the model.
  • The dataset was split as follows:
    • Faster RCNN: 80 images for training, 6 for validation.
    • RetinaNet (large inputs) and Faster RCNN(large inputs): 78 images for training, 8 for validation.
  • Two data augmentation strategies were employed:
    • For the run named "Faster RCNN, large inputs, data aug.", training images were
      • were horizontally flipped with probability p=0.5p = 0.5;
      • were rotated in the range [45,45][-45^\circ, 45^\circ];
      • had their brightness multiplied by a factor in the range [0.65,1.35][0.65, 1.35].
    • In the other runs, only horizontal flipping with p=0.5p=0.5 was used.
  • Runs marked as "large inputs" were trained with images with resolution of 1920×28801920 \times 2880. Otherwise, the resolution was 1024×15361024 \times 1536.

Anchor box aspect ratio

To select the aspect ratio of anchor boxes for Faster RCNN, let's take a look at the aspect ratios of the object annotations in the provided dataset.
The `Dataset` class implemented for this dataset already computes aspect ratios, so we'll plot them on a histogram.

The dominant aspect ratio is 1:1, but to be safe, let's use anchor boxes of ratios 0.85, 1.0 and 1.15.

Annotation sizes

In case we need this information to customize our model, let's plot the actual sizes of the object annotations.
w h
count 2822 2822
mean 70.585796 70.769973
std 27.721963 28.791432
min 19.2288 19.152
25% 50.8191 49.248
50% 67.301 65.66405
75% 86.530138 87.552
max 234.8675 229.8239



Training metrics




Evaluation metrics




Sample detections

Faster RCNN (large inputs)




RetinaNet (large inputs)




Faster RCNN, large inputs, data aug.