Skip to main content

Segment Anything Model (SAM)

A general-purpose image segmentation model called Segment Anything Model (SAM) was just released from Meta AI. Here's my take.
Created on April 6|Last edited on April 6


TL; DR

A general-purpose image segmentation model called Segment Anything Model (SAM) was just released from Meta AI. Along with it, they released their newly created (and largest to date) image segmentation dataset called Segment Anything 1-Billion mask dataset. Specs on this dataset are found in their blog page:
Total number of images: 11M Total number of masks: 1.1B Average masks per image: 100 Average image resolution: 1500×2250 pixels NOTE: There are no class labels for the images or mask annotations.
What's interesting is that they generalized image segmentation. There are no classes for the masks. The model aims not to segment and identify but to only segment.
They describe the model as promptable (image segmentations can be done with points box, or text) and a strong zero or few-shot model. The dataset they trained on, mentioned above, captures a wide range of objects. Because of the interactive nature of their data collection method, they needed a lightweight model to run on the CPU in the browser.

They have 1 image encoder (a ViT) to generate an image embedding, a prompt encoder (off-the-shelf CLIP) that considers points, boxes, and text, and a lightweight mask decoder (MLP with attention) to combine both the image embedding and the encoded prompt into a masked image. The model is constructed to be ambiguous or agnostic to what prompt you provide. It will generate 3 valid masks as output.
Their dataset is also nothing short of remarkable. Below is a graphic of the scale of this dataset!

Because it was not scalable to have hand annotators interactively annotate (though this process was quite fast), they had SAM generate pseudo-label masks for new images which would in turn be fed back into the model. They describe the dataset generation process in 3 gears: SAM assists annotators, half SAM, and half annotators, and fully automated with SAM.
Gear/Stage 1: Assisted-manual stage
  • SAM initially trained on public segmentation datasets
  • professional annotators speed ran labeling with the help of SAM in an interactive web browser
  • collected 4.3M masks from 120k images in this stage
  • retrained 6 times and scaled up model
  • 20 masks/image grew to 44 masks/image
Gear/Stage 2: Semi-automatic stage
  • aimed to increase diversity of masks so annotators were given images with confident masks baked in and they were asked to segment other objects in the image
  • collected 5.9M masks from 180k images (totaling 10.2M masks)
  • 44 masks/image to 72 masks/image
Gear/Stage 3: fully automated stage
  • since they found their model to be robust enough at this stage, they generated the 1.1B masks on roughly 11M images

References

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.