INTRODUCING V-JEPA 2
Video Joint Embedding Predictive Architecture 2 (V-JEPA 2) is the first world model trained on video that achieves state-of-the-art visual understanding and prediction, enabling zero-shot robot control in new environments.
CAPABILITIES
V-JEPA 2 is the next step towards our vision for AI that leverages a world model to understand physical reality, anticipate outcomes, and plan efficient strategies—all with minimal supervision.
Read the research paperV-JEPA 2 delivers exceptional motion understanding as well as leading visual reasoning capabilities when combined with language modeling.
V-JEPA 2 can make predictions about how the world will evolve, setting a new state-of-the-art in anticipating actions from contextual cues.
Building on the ability to understand and predict, V-JEPA 2 can be used for zero-shot robot planning to interact with unfamiliar objects in new environments.
We train V-JEPA 2 on 62 hours of robot data from the Droid dataset, then deploy it on a robot arm in new environments. By specifying tasks as goal images, the model accomplishes tasks like reaching, grasping, and pick-and-place. Being task-agnostic, it can be trained without extensive robot data or task-specific demonstrations.
MODEL ARCHITECTURE
V-JEPA 2 employs a two-phase training approach.
The encoder and predictor are pre-trained through self-supervised learning from visual data, leveraging abundant natural videos to bootstrap physical world understanding and prediction.
Fine-tuning on a small amount of robot data enables efficient planning without requiring extensive expert robot demonstrations, which are much harder to collect at scale.
WORLD MODELS
What if AI could reason and plan as effortlessly as we do? This is one of the grand scientific challenges we’re tackling at Meta.
APPLICATIONS
Potential model applications
We’re releasing V-JEPA 2 for the community to build upon this work. We expect world models to power novel experiences and groundbreaking applications across diverse domains.
Robotic assistants
We expect world models to unlock a new era of robotics, powering AI agents that navigate physical environments to tackle household chores and complex tasks.
Wearable assistants
World models can enable assistive technology that helps individuals navigate busy environments, providing real-time alerts about approaching obstacles and hazards.
Our approach
Latest news
Foundational models