Researchers Present New SOTA in Unsupervised Image Detection
Computer vision researchers turn to unsupervised methods to increase object localization performance
Created on February 3|Last edited on February 3
Comment
In the recent paper “Cut and Learn for Unsupervised Object Detection and Instance Segmentation,” the authors from Meta, UC Berkely, and The University of Michigan describe their main contribution, CutLER, which improves existing unsupervised detection performance by over 2.7x on several benchmarks.
They train CutLER on strictly unsupervised Imagenet data, and build off previous self-supervised works involving vision transformers like DINO [1] and TokenCut [2].
DINO leverages self-supervised vision transformers to automatically learn a certain degree of “perceptual grouping” within patches of an image.
TokenCut is based around a method called Normalized cutting, which essentially reframes the segmentation task into a graph partitioning problem. TokenCut is able to leverage features generated from DINO, along with normalized cuts, to find foreground and background patches in an image.
One drawback to TokenCut is that it is only able to find one object per image.
The authors introduce MaskCut to extend TokenCut for detecting multiple objects in a single image.
Next, the authors introduce DropLoss, which is a strategy for allowing the model to essentially ignore predictions that have low overlap with the labels predicted by MaskCut, which allows the model to explore more of the image, and be less limited by the MaskCut mechanism. The authors also assert CutLER adds modularity benefits, as it can be integrated into existing detection models, whereas previous works were limited to specific architectures.
Additionally, they claim the method has strong robustness against domain shifts, contrary to previous works. Finally, CutLER also allows for a more efficient way to pre-train models, which could theoretically allow for less labeled data without sacrificing performance.
The unsupervised Trend
It’s exciting to see progress in unsupervised computer vision. The rise of large language models for textual modalities is perhaps a preview for what's coming in the field of computer vision. With the large cost of labeled data, methods like CutLER will likely show a direct impact in areas like self-driving cars and robotics.
The paper
References
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.