YOLACT: Real-time Instance Segmentation

A closer look at how this model architecture learns step by step.
Created on January 15|Last edited on February 3
Comment
﻿
﻿
What is YOLACT?(Link to original paper: https://arxiv.org/pdf/1904.02689.pdf)
YOLACT, short for "You Only Look At CoefficientTs", is a proposed solution for achieving real-time instance segmentation that can achieve 29.8 mAP on the MS COCO dataset at 33.5 frames per second, when trained and evaluated on a single Titan Xp. At the time the paper was published (Oct 24 2019), this was a state of the art solution when it comes to speed of evaluation. 
This report in particular will focus on the training and evaluation process of this architecture, and how it differs from earlier instance segmentation solutions, such as Mask R-CNN, which are very accurate but slow at inference time. 
The paper describes the method devised for speeding up time of inference (and therefore evaluation steps), by removing a step of the segmentation process; the need for a "feature localization" step when producing masks. Solutions using this "feature localization" method require the "re-pooling" of features within a specific bounding box region, and then feeding those localized features to their mask predictor. Because of the sequential nature of this process, acceleration is quite difficult. Some one stage methods already existed at the time, but they required significant post-processing after the localization step, and were not performant in a real time environment.
YOLACT forgoes this localization step by first generating a dictionary of non-local prototype masks over the entire image, and predicting a set of linear combination coefficients per mask instance. These steps happen in parallel.
By using this method for segmentation, the network learns how to localize the separate mask instances on its own.
Why is this important?Because of the parallel structure and relatively lightweight assembly structure used to predict instance segmentations, it's very fast at inference time. There's only a little bit of additional computation expense added on top of whichever backbone detector is being used. Even with ResNet-101 as a backbone, reaching 30 frames per second is possible.
Since the masks consider the full extent of the image space without any quality loss from a re-pooling step, this architecture shines when it comes to segmentation of larger objects, a place where other methods at the time struggled.
﻿
﻿
﻿
Run: playful-fog-341
﻿
﻿
﻿
 
Run set34
﻿
﻿
﻿
Run: playful-fog-341
﻿
﻿
﻿
 
Run set34
﻿
﻿
Add a comment