Object Detection with RetinaNet

Anil Karaka

I created a fork of Keras RetinaNet for object detection on the COCO 2017 dataset.

RetinaNet consists of a backbone network, and two sub-nets that makes use of feature maps of the backbone network. One classification subnet identifies the class of the image, and one regression subnet figures out the bounding box. Input images vary in resolution and size, so RetinaNet uses feature maps at various resolutions. This makes the training faster, and it’s less clumsy than feeding the network the same image at various resolutions.

Instead of using the last feature map of the backbone, we use feature maps generated at various levels of the backbone network. This sort of network is also called feature pyramid networks (FPNs). If you’re interested in a more detailed look at the RetinaNet architecture, Nick Zeng wrote a good article.

In the Weights & Biases dashboard, the Model tab shows how feature maps are used before sub networks.

I used resnet50 and resnet101 while training my models. In each epoch, I tried 500 steps, 5000 steps, and 10,000 steps. To speed up the training process, at the end of each epoch I sampled just 100 random images from the validation set.

I logged my best performing model with resnet50 backbone, step-size 5000 at every epoch, and trained for 91 epochs. If you want the model weights, you can download my checkpoints for resnet50 and resnet101. The total training time was 6 days— I resumed training on a model that was trained just for 50 epochs. With an AP50 of 0.5924, this model came around 20th in the official COCO leaderboard. This metric was calculated on the validation set, not the official test set, but I’m happy that I was able to get good results relatively quickly. I used EC2 P2.xlarge instances for training, and this project took a total of 39 days of compute time across 92 runs.

Visualize results from my 92 different runs and dive into results in this interactive report.

Join our mailing list to get the latest machine learning updates.