Takeaways:
Today, we’re announcing the Meta Segment Anything Model 2 (SAM 2), the next generation of the Meta Segment Anything Model, now supporting object segmentation in videos and images. We’re releasing SAM 2 under an Apache 2.0 license, so anyone can use it to build their own experiences. We’re also sharing SA-V, the dataset we used to build SAM 2 under a CC BY 4.0 license and releasing a web-based demo experience where everyone can try a version of our model in action.
Object segmentation—identifying the pixels in an image that correspond to an object of interest—is a fundamental task in the field of computer vision. The Meta Segment Anything Model (SAM) released last year introduced a foundation model for this task on images.
Our latest model, SAM 2, is the first unified model for real-time, promptable object segmentation in images and videos, enabling a step-change in the video segmentation experience and seamless use across image and video applications. SAM 2 exceeds previous capabilities in image segmentation accuracy and achieves better video segmentation performance than existing work, while requiring three times less interaction time. SAM 2 can also segment any object in any video or image (commonly described as zero-shot generalization), which means that it can be applied to previously unseen visual content without custom adaptation.
Before SAM was released, creating an accurate object segmentation model for specific image tasks required highly specialized work by technical experts with access to AI training infrastructure and large volumes of carefully annotated in-domain data. SAM revolutionized this space, enabling application to a wide variety of real-world image segmentation and out-of-the-box use cases via prompting techniques—similar to how large language models can perform a range of tasks without requiring custom data or expensive adaptations.
In the year since we launched SAM, the model has made a tremendous impact across disciplines. It has inspired new AI-enabled experiences in Meta’s family of apps, such as Backdrop and Cutouts on Instagram, and catalyzed diverse applications in science, medicine, and numerous other industries. Many of the largest data annotation platforms have integrated SAM as the default tool for object segmentation annotation in images, saving millions of hours of human annotation time. SAM has also been used in marine science to segment Sonar images and analyze coral reefs, in satellite imagery analysis for disaster relief, and in the medical field, segmenting cellular images and aiding in detecting skin cancer.
As Mark Zuckerberg noted in an open letter last week, open source AI “has more potential than any other modern technology to increase human productivity, creativity, and quality of life,” all while accelerating economic growth and advancing groundbreaking medical and scientific research. We’ve been tremendously impressed by the progress the AI community has made using SAM, and we envisage SAM 2 unlocking even more exciting possibilities.
In keeping with our open science approach, we’re sharing our research on SAM 2 with the community so they can explore new capabilities and use cases. The artifacts we’re sharing today include:
As a unified model, SAM 2 can power use cases seamlessly across image and video data and be extended to previously unseen visual domains. For the AI research community and others, SAM 2 could be a component as part of a larger AI system for a more general multimodal understanding of the world. In industry, it could enable faster annotation tools for visual data to train the next generation of computer vision systems, such as those used in autonomous vehicles. SAM 2’s fast inference capabilities could inspire new ways of selecting and interacting with objects in real time or live video. For content creators, SAM 2 could enable creative applications in video editing and add controllability to generative video models. SAM 2 could also be used to aid research in science and medicine—for example, tracking endangered animals in drone footage or localizing regions in a laparoscopic camera feed during a medical procedure. We believe the possibilities are broad, and we’re excited to share this technology with the AI community to see what they build and learn.
How we built SAM 2
SAM was able to learn a general notion of what objects are in images. However, images are only a static snapshot of the dynamic real world in which visual segments can exhibit complex motion. Many important real-world use cases require accurate object segmentation in video data, for example in mixed reality, robotics, autonomous vehicles, and video editing. We believe that a universal segmentation model should be applicable to both images and video.
An image can be considered a very short video with a single frame. We adopt this perspective to develop a unified model that supports both image and video input seamlessly. The only difference in handling video is that the model needs to rely on memory to recall previously processed information for that video in order to accurately segment an object at the current timestep.
Successful segmentation of objects in video requires an understanding of where entities are across space and time. Compared to segmentation in images, videos present significant new challenges. Object motion, deformation, occlusion, lighting changes, and other factors can drastically change from frame to frame. Videos are often lower quality than images due to camera motion, blur, and lower resolution, adding to the difficulty. As a result, existing video segmentation models and datasets have fallen short in providing a comparable “segment anything” capability for video. We solved many of these challenges in our work to build SAM 2 and the new SA-V dataset.
Similar to the methodology we used for SAM, our research on enabling video segmentation capabilities involves designing a new task, a model, and a dataset. We first develop the promptable visual segmentation task and design a model (SAM 2) capable of performing this task. We use SAM 2 to aid in creating a video object segmentation dataset (SA-V), which is an order of magnitude larger than anything that exists currently, and use this to train SAM 2 to achieve state-of-the-art performance.
Promptable visual segmentation
We design a promptable visual segmentation task that generalizes the image segmentation task to the video domain. SAM was trained to take as input points, boxes, or masks in an image to define the target object and predict a segmentation mask. With SAM 2, we train it to take input prompts in any frame of a video to define the spatio-temporal mask (i.e. a “masklet”) to be predicted. SAM 2 makes an immediate prediction of the mask on the current frame based on the input prompt and temporally propagates it to generate the masklet of the target object across all video frames. Once an initial masklet has been predicted, it can be iteratively refined by providing additional prompts to SAM 2 in any frame. This can be repeated as many times as required until the desired masklet is obtained.
Image and video segmentation in a unified architecture
The SAM 2 architecture can be seen as a generalization of SAM from the image to the video domain. SAM 2 can be prompted by clicks (positive or negative), bounding boxes, or masks to define the extent of the object in a given frame. A lightweight mask decoder takes an image embedding for the current frame and encoded prompts to output a segmentation mask for the frame. In the video setting, SAM 2 propagates this mask prediction to all video frames to generate a masklet. Prompts can then be iteratively added on any subsequent frame to refine the masklet prediction.
To predict masks accurately across all video frames, we introduce a memory mechanism consisting of a memory encoder, a memory bank, and a memory attention module. When applied to images, the memory components are empty and the model behaves like SAM. For video, the memory components enable storing information about the object and previous user interactions in that session, allowing SAM 2 to generate masklet predictions throughout the video. If there are additional prompts provided on other frames, SAM 2 can effectively correct its predictions based on the stored memory context of the object.
Memories of frames are created by the memory encoder based on the current mask prediction and placed in the memory bank for use in segmenting subsequent frames. The memory bank consists of both memories from previous frames and prompted frames. The memory attention operation takes the per-frame embedding from the image encoder and conditions it on the memory bank to produce an embedding that is then passed to the mask decoder to generate the mask prediction for that frame. This is repeated for all subsequent frames.
We adopt a streaming architecture, which is a natural generalization of SAM to the video domain, processing video frames one at a time and storing information about the segmented objects in the memory. On each newly processed frame, SAM 2 uses the memory attention module to attend to the previous memories of the target object. This design allows for real-time processing of arbitrarily long videos, which is important not only for annotation efficiency in collecting the SA-V dataset but also for real-world applications—for example, in robotics.
SAM introduced the ability to output multiple valid masks when faced with ambiguity about the object being segmented in an image. For example, when a person clicks on the tire of a bike, the model can interpret this click as referring to only the tire or the entire bike and output multiple predictions. In videos, this ambiguity can extend across video frames. For example, if in one frame only the tire is visible, a click on the tire might relate to just the tire, or as more of the bike becomes visible in subsequent frames, this click could have been intended for the entire bike. To handle this ambiguity, SAM 2 creates multiple masks at each step of the video. If further prompts don’t resolve the ambiguity, the model selects the mask with the highest confidence for further propagation in the video.
In the image segmentation task, there is always a valid object to segment in a frame given a positive prompt. In video, it’s possible for no valid object to exist on a particular frame, for example due to the object becoming occluded or disappearing from view. To account for this new output mode, we add an additional model output (“occlusion head”) that predicts whether the object of interest is present on the current frame. This enables SAM 2 to effectively handle occlusions.
SA-V: Building the largest video segmentation dataset
One of the challenges of extending the “segment anything” capability to video is the limited availability of annotated data for training the model. Current video segmentation datasets are small and lack sufficient coverage of diverse objects. Existing dataset annotations typically cover entire objects (e.g., person), but lack object parts (e.g., person’s jacket, hat, shoes), and datasets are often centered around specific object classes, such as people, vehicles, and animals.
To collect a large and diverse video segmentation dataset, we built a data engine, leveraging an interactive model-in-the-loop setup with human annotators. Annotators used SAM 2 to interactively annotate masklets in videos, and then the newly annotated data was used to update SAM 2 in turn. We repeated this cycle many times to iteratively improve both the model and dataset. Similar to SAM, we do not impose semantic constraints on the annotated masklets and focus on both whole objects (e.g., a person) and object parts (e.g., a person’s hat).
With SAM 2, collecting new video object segmentation masks is faster than ever before. Annotation with our tool and SAM 2 in the loop is approximately 8.4 times faster than using SAM per frame and also significantly faster than combining SAM with an off-the-shelf tracker.
Our released SA-V dataset contains over an order of magnitude more annotations and approximately 4.5 times more videos than existing video object segmentation datasets.
Highlights of the SA-V dataset include:
Results
To create a unified model for image and video segmentation, we jointly train SAM 2 on image and video data by treating images as videos with a single frame. We leverage the SA-1B image dataset released last year as part of the Segment Anything project, the SA-V dataset, and an additional internal licensed video dataset.
Key highlights that we detail in our research paper include:
It’s important that we work to build AI experiences that work well for everyone. In order to measure the fairness of SAM 2, we conducted an evaluation on model performance across certain demographic groups. Our results show that the model has minimal performance discrepancy in video segmentation on perceived gender and little variance among the three perceived age groups we evaluated: ages 18 – 25, 26 – 50, and 50+.
Limitations
While SAM 2 demonstrates strong performance for segmenting objects in images and short videos, the model performance can be further improved—especially in challenging scenarios.
SAM 2 may lose track of objects across drastic camera viewpoint changes, after long occlusions, in crowded scenes, or in extended videos. We alleviate this issue in practice by designing the model to be interactive and enabling manual intervention with correction clicks in any frame so the target object can be recovered.
When the target object is only specified in one frame, SAM 2 can sometimes confuse objects and fail to segment the target correctly, as shown with the horses in the above video. In many cases, with additional refinement prompts in future frames, this issue can be entirely resolved and the correct masklet can be obtained throughout the video.
While SAM 2 supports the ability to segment multiple individual objects simultaneously, the efficiency of the model decreases considerably. Under the hood, SAM 2 processes each object separately, utilizing only shared per-frame embeddings, without inter-object communication. While this simplifies the model, incorporating shared object-level contextual information could aid in improving efficiency.
For complex fast moving objects, SAM 2 can sometimes miss fine details and the predictions can be unstable across frames (as shown in the video of the cyclist above). Adding further prompts to refine the prediction in the same frame or additional frames can only partially alleviate this problem During training we do not enforce any penalty on the model predictions if they jitter between frames, so temporal smoothness is not guaranteed. Improving this capability could facilitate real-world applications that require detailed localization of fine structures.
While our data engine uses SAM 2 in the loop and we’ve made significant strides in automatic masklet generation, we still rely on human annotators for some steps such as verifying masklet quality and selecting frames that require correction. Future developments could include further automating the data annotation process to enhance efficiency.
There’s still plenty more work to be done to propel this research even further. We hope the AI community will join us by building with SAM 2 and the resources we’ve released. Together, we can accelerate open science to build powerful new experiences and use cases that benefit people and society.
Putting SAM 2 to work
While many of Meta FAIR’s models used in public demos are hosted on Amazon SageMaker, the session-based requirements of the SAM 2 model pushed up against the boundaries of what our team believed was previously possible on AWS AI Infra. Thanks to the advanced model deployment and managed inference capabilities offered by Amazon SageMaker, we’ve been able to make the SAM 2 release possible—focusing on building state of the art AI models and unique AI demo experiences.
We encourage the AI community to download the model, use the dataset, and try our demo. By sharing this research, we hope to contribute to accelerating progress in universal video and image segmentation and related perception tasks. We look forward to seeing the new insights and useful experiences that will be created by releasing this research to the community.
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.
Join us in the pursuit of what’s possible with AI.
Foundational models
Latest news
Foundational models