Meta Launches SAM 3 and SAM 3D
Created on November 20|Last edited on November 20
Comment
Meta has unveiled two major releases in its Segment Anything ecosystem: Segment Anything Model 3 (SAM 3) and SAM 3D. SAM 3 is a unified model that can detect, segment, and track objects in images and videos using a variety of prompts, including text and visual cues. In parallel, SAM 3D brings new capabilities for reconstructing detailed 3D representations of objects and human bodies from a single image. These releases are part of Meta AI’s broader mission to push the boundaries of perception and interaction in artificial intelligence. Alongside these models, Meta introduced the Segment Anything Playground, a web-based platform for experimenting with visual AI, and made the research code, benchmarks, and datasets available for the AI community.
Promptable Segmentation with SAM 3
SAM 3 introduces open-vocabulary and exemplar-based segmentation, allowing it to identify and track objects not confined to a predefined set of labels. This enables users to segment more specific and complex concepts like “the striped red umbrella” rather than only generic ones like “person” or “car.” By accepting text, exemplar images, and visual prompts like masks or bounding boxes, SAM 3 allows for dynamic and nuanced interactions. This functionality is supported by the new Segment Anything with Concepts (SA-Co) benchmark, which evaluates models across a broader vocabulary than previous segmentation datasets.
Reconstructing the 3D World with SAM 3D
SAM 3D is a separate but related advancement, offering two models that reconstruct objects and humans in 3D from a single 2D image. SAM 3D Objects reconstructs detailed geometry and texture of objects, while SAM 3D Body infers the shape and pose of people, even from partially visible figures. These reconstructions are suitable for applications like AR interaction, physical therapy, or robotics, and can place objects and people into a shared 3D scene. SAM 3D works in real-world, in-the-wild images, going beyond synthetic or curated datasets. This system powers new features like “View in Room” for Facebook Marketplace, allowing users to see how items like lamps or tables would appear in their homes.
A Hybrid Human-AI Data Engine
To train these advanced models, Meta built a scalable annotation engine that combines human oversight with AI automation. Models like Llama 3.2v support the verification and refinement of object masks and labels, helping annotate large datasets efficiently. The pipeline automatically parses captions, generates masks, and evaluates their quality, with human annotators focusing on edge cases where models underperform. This process has resulted in a diverse dataset of more than 4 million concepts, supporting better generalization across domains and improving the overall robustness of both SAM 3 and SAM 3D.
Architecture Behind the Models
The technical underpinnings of both SAM 3 and SAM 3D reflect Meta AI’s work in building unified, modular vision systems. SAM 3 builds on Meta’s Perception Encoder and DETR object detection framework, integrating memory components for tracking across video frames. It unifies various segmentation tasks, from simple object detection to promptable segmentation and tracking. SAM 3D, on the other hand, uses a two-stage DiT (Denoising Diffusion Transformer) model to handle shape, pose, and texture refinement. The SAM 3D Body model uses a transformer encoder-decoder design to regress mesh parameters and 3D pose directly from images. Both systems are designed to be interactive and extensible for practical applications.
Benchmark Performance Across Tasks
SAM 3 sets a new standard in concept segmentation, doubling accuracy scores on the SA-Co benchmark relative to existing models. It also outperforms specialist systems like OWLv2 and generalist vision models such as Gemini 2.5 Pro. In preference tests, users chose SAM 3 outputs over OWLv2 three to one. For SAM 3D, Meta released the SAM 3D Artist Object Dataset, a diverse collection of 3D meshes grounded in real-world imagery, offering a more challenging benchmark for 3D reconstruction than synthetic datasets. Both SAM 3D models achieve state-of-the-art performance on their respective tasks, from object modeling to body pose estimation.
Real-World Applications from Shopping to Science
Meta is deploying SAM 3 and SAM 3D across its platforms and in research. On Facebook Marketplace, the View in Room feature uses SAM 3D to create AR overlays of furniture in real space. In scientific contexts, SAM 3 is used in wildlife conservation projects like SA-FARI, which includes annotated video footage of over 100 species. FathomNet is using the technology to improve underwater object segmentation for marine exploration. In the creative domain, SAM 3 powers new effects in the Instagram Edits app and the Meta AI app’s Vibes feature, enabling creators to apply targeted visual changes with minimal effort.
Limitations and Opportunities for Future Research
Despite strong performance, SAM 3 and SAM 3D have limitations. SAM 3 struggles with long or relational prompts like “the second book from the left,” and performance can drop in zero-shot cases involving niche domains like medical imagery. SAM 3D, while strong in reconstruction, is also challenged by rare shapes and partial occlusion in unfamiliar contexts. Meta addresses this by releasing fine-tuning tools and partnering with Roboflow to make it easier for developers to adapt models for specific use cases. SAM 3 also processes each tracked object separately, which can be inefficient in scenes with many overlapping elements. Future work could focus on shared object context to improve speed and accuracy.
Exploring SAM 3 and SAM 3D Through the Playground
To make these models accessible, Meta launched the Segment Anything Playground — an interactive platform that lets users upload media and experiment with segmentation and 3D reconstruction tools. Templates help users apply privacy blurring, visual effects, and object tracking without coding. Select videos from the Aria Gen 2 smart glasses are also available, showing SAM 3’s capability in wearable camera footage. This immersive platform supports both public testing and research exploration, giving users a hands-on understanding of Meta’s models.
Conclusion
With SAM 3 and SAM 3D, Meta is advancing the state of visual AI across 2D and 3D domains. These models support flexible prompts, generalize across tasks, and enable new experiences in both everyday tools and scientific research. Through open-source code, shared datasets, and platforms like the Playground, Meta is inviting the AI community to build, fine-tune, and explore the possibilities of open-vocabulary segmentation and 3D perception. Together, these tools represent a major step toward more interactive, adaptable, and intelligent visual systems.
Add a comment