Meta unveils Locate 3D: Self-Supervised Scene Understanding for Real-World Object Localization
Created on May 9|Last edited on May 9
Comment
Locate 3D is a self-supervised model developed by FAIR at Meta for localizing objects in 3D environments using natural language queries. The system interprets commands like “the small coffee table between the sofa and the lamp” by identifying corresponding objects directly from sensor data, without requiring manually refined meshes or segmentation at inference time. Its architecture is specifically designed to integrate RGB-D inputs and text instructions to operate in real-world environments such as those encountered by robots or AR systems.
The Role of 3D-JEPA in Self-Supervised Representation Learning
Central to Locate 3D is the 3D-JEPA pretraining framework. JEPA, originally used for 2D vision, is extended here to 3D, enabling self-supervised learning on point clouds enriched with features from 2D foundation models like CLIP and DINO. Rather than learning from explicit object annotations, 3D-JEPA trains the system to predict latent features of masked regions in a 3D point cloud, resulting in richer and more contextualized scene representations. These features go beyond local patches and encode holistic spatial and semantic understanding.
Architecture and Localization Workflow
Locate 3D operates in three distinct stages: first, it processes RGB-D inputs to generate a voxelized point cloud enriched with lifted 2D features. Then, the 3D-JEPA encoder creates a contextualized feature space across the entire scene. Finally, a language-conditioned decoder matches these features to a natural language query and outputs 3D masks and bounding boxes. The decoder uses multi-stage attention and cross-attention mechanisms, along with prediction heads to directly infer object locations and their linguistic associations.


Locate 3D Dataset (L3DD) and Evaluation
The team introduced the Locate 3D Dataset (L3DD) to support robust training and evaluation. L3DD includes over 130,000 annotations across 1,346 scenes drawn from ScanNet, ScanNet++, and ARKitScenes. Compared to prior datasets, L3DD provides greater diversity in scenes, objects, and annotations, enabling a more rigorous test of model generalization. When incorporated into training (as Locate 3D+), it yields significant performance boosts, highlighting the value of scene diversity in data.

Performance Benchmarks and Generalization
Locate 3D achieves state-of-the-art results on referential grounding benchmarks like SR3D, NR3D, and ScanRefer, outperforming prior models that rely on mesh-based input. It also demonstrates strong out-of-domain generalization. Notably, performance improves further with 3D-JEPA pretraining and foundation features from larger models like CLIP-L and DINOv2. When evaluated on unseen datasets and robot environments, Locate 3D consistently maintains robust accuracy, validating its design for real-world applications.


Robot Deployment and Real-World Implications
Locate 3D was deployed on a Spot robot tasked with navigating a multi-room apartment and identifying a plush toy. It succeeded in 8 out of 10 trials, outperforming other baselines. This deployment illustrates the system’s ability to operate without post-processing or annotations, making it a compelling choice for embodied AI tasks such as manipulation, navigation, and AR-guided interaction.
Comparison to Prior Work in 3D Referring Expression
Locate 3D advances prior work by removing reliance on region proposals, meshes, and post-processed reconstructions. Unlike models that depend on 2D-to-3D projection or require fine-tuned visual pipelines, Locate 3D provides an integrated architecture that handles raw sensor input. This simplifies deployment and makes the model more adaptable to real-world variability, from home robotics to AR headsets.
Limitations and Future Work
A key limitation of Locate 3D is its reliance on static or quasi-static environments, which allows caching of features for faster inference. Extending the approach to dynamic scenes will require online feature lifting and continuous updates, which pose engineering and research challenges. Furthermore, while 3D-JEPA shows strong generalization, future work may explore integrating real-time adaptation mechanisms or extending support to broader multimodal understanding beyond referential grounding.
Conclusion
Locate 3D introduces a practical and scalable method for localizing objects in 3D environments using self-supervised learning and foundation model features. Through its integration of 3D-JEPA, an innovative self-supervised algorithm, and a flexible decoding architecture, the system delivers robust performance across benchmarks and real-world settings. Locate 3D sets a new standard for referential grounding and opens avenues for more adaptable and general-purpose spatial understanding models in embodied AI.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.