Meta Unveils V-JEPA 2: A World Model for Physical Understanding, Prediction, and Robot Planning

Created on June 11|Last edited on June 11
Comment
Meta has released V-JEPA 2, its latest advancement in world models for physical reasoning and robot planning. The model builds on Meta’s earlier JEPA (Joint Embedding Predictive Architecture) framework and is designed to process and learn from raw video data. At 1.2 billion parameters, V-JEPA 2 is capable of state-of-the-art prediction and action understanding, while also enabling zero-shot robot control in unfamiliar settings. By open-sourcing the model and its associated benchmarks, Meta aims to accelerate the development of general-purpose AI agents that can perceive, reason, and act in complex real-world environments.
What World Models Aim to SolveWorld models are central to the idea of enabling AI systems to understand and interact with their environment the way humans do. These models don’t just identify objects or actions—they internally simulate how the world might evolve. This ability allows agents to make plans, anticipate consequences, and adapt to new scenarios. Humans, for example, instinctively know that a dropped glass will fall and shatter. That kind of intuitive physics underpins human behavior in everything from sports to cooking. Meta wants AI agents to develop similar intuition using visual experience rather than explicit programming.
Architecture and Training of V-JEPA 2V-JEPA 2 uses a joint embedding predictive architecture that includes an encoder and a predictor. The encoder transforms video into meaningful embeddings that represent the observed world. The predictor uses these embeddings and context to forecast future states. Importantly, the model is trained in two phases. In the first phase, actionless pre-training, it learns from more than a million hours of video and images. This builds a general understanding of physical interactions. In the second phase, it is fine-tuned with robot data that includes both visual observations and the robot’s control actions, allowing it to predict how its own actions will change the environment.
﻿
This setup allows V-JEPA 2 to excel on multiple tasks. For example, it achieves top results on the Something-Something v2 dataset for motion understanding and on the Epic-Kitchens-100 benchmark for action anticipation. When aligned with a language model, it also sets new standards on video question answering datasets like Perception Test and TempCompass.
Robot Planning with Visual GoalsV-JEPA 2 is not just a theoretical improvement—it performs real-world planning tasks without needing fine-tuning for specific environments. Using only 62 hours of robot data, the model can be deployed for zero-shot control tasks such as picking and placing objects it hasn’t seen before. For short-horizon tasks, the system uses visual goal states and predicts sequences of candidate actions, selecting the one that brings the robot closer to its goal. For longer tasks, a sequence of subgoals is provided, similar to how a human imitates step-by-step instructions. In both settings, V-JEPA 2 shows strong performance, achieving 65 to 80 percent success rates on unseen object manipulation tasks.
Benchmarking Physical UnderstandingTo help the broader research community measure progress in this area, Meta is releasing three new benchmarks. IntPhys 2 evaluates whether models can recognize physically plausible scenarios. It uses paired videos where one subtly breaks physical laws, and the model must detect which one is incorrect. MVPBench tests video-language models using minimal-change video pairs to avoid exploiting shortcuts. CausalVQA focuses on cause-and-effect understanding in video, covering counterfactual reasoning and future predictions. While humans perform near-perfectly on these tasks, V-JEPA 2 and similar models still fall short—highlighting how far the field has to go.
Next Steps for Advanced Machine IntelligenceMeta’s long-term goal is to build world models that can operate over multiple time scales and sensory modalities. Future versions of JEPA may support hierarchical planning and use additional inputs like audio and tactile data. By sharing V-JEPA 2, the associated training code, benchmarks, and performance data, Meta is aiming to establish a foundation for further exploration into AI that can adapt to, reason about, and plan within the physical world.
As AI development continues to focus on real-world applicability, V-JEPA 2 marks a significant step in bridging perception and action, bringing general-purpose embodied intelligence closer to reality.
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.