Open Source
Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images
July 29, 2024
15 minute read

Takeaways:


  • Following up on the success of the Meta Segment Anything Model (SAM) for images, we’re releasing SAM 2, a unified model for real-time promptable object segmentation in images and videos that achieves state-of-the-art performance.
  • In keeping with our approach to open science, we’re sharing the code and model weights with a permissive Apache 2.0 license.
  • We’re also sharing the SA-V dataset, which includes approximately 51,000 real-world videos and more than 600,000 masklets (spatio-temporal masks).
  • SAM 2 can segment any object in any video or image—even for objects and visual domains it has not seen previously, enabling a diverse range of use cases without custom adaptation.
  • SAM 2 has many potential real-world applications. For example, the outputs of SAM 2 can be used with a generative video model to create new video effects and unlock new creative applications. SAM 2 could also aid in faster annotation tools for visual data to build better computer vision systems.

A preview of the SAM 2 web-based demo, which allows segmenting and tracking objects in video and applying effects.

Today, we’re announcing the Meta Segment Anything Model 2 (SAM 2), the next generation of the Meta Segment Anything Model, now supporting object segmentation in videos and images. We’re releasing SAM 2 under an Apache 2.0 license, so anyone can use it to build their own experiences. We’re also sharing SA-V, the dataset we used to build SAM 2 under a CC BY 4.0 license and releasing a web-based demo experience where everyone can try a version of our model in action.

Object segmentation—identifying the pixels in an image that correspond to an object of interest—is a fundamental task in the field of computer vision. The Meta Segment Anything Model (SAM) released last year introduced a foundation model for this task on images.

Our latest model, SAM 2, is the first unified model for real-time, promptable object segmentation in images and videos, enabling a step-change in the video segmentation experience and seamless use across image and video applications. SAM 2 exceeds previous capabilities in image segmentation accuracy and achieves better video segmentation performance than existing work, while requiring three times less interaction time. SAM 2 can also segment any object in any video or image (commonly described as zero-shot generalization), which means that it can be applied to previously unseen visual content without custom adaptation.

Before SAM was released, creating an accurate object segmentation model for specific image tasks required highly specialized work by technical experts with access to AI training infrastructure and large volumes of carefully annotated in-domain data. SAM revolutionized this space, enabling application to a wide variety of real-world image segmentation and out-of-the-box use cases via prompting techniques—similar to how large language models can perform a range of tasks without requiring custom data or expensive adaptations.

In the year since we launched SAM, the model has made a tremendous impact across disciplines. It has inspired new AI-enabled experiences in Meta’s family of apps, such as Backdrop and Cutouts on Instagram, and catalyzed diverse applications in science, medicine, and numerous other industries. Many of the largest data annotation platforms have integrated SAM as the default tool for object segmentation annotation in images, saving millions of hours of human annotation time. SAM has also been used in marine science to segment Sonar images and analyze coral reefs, in satellite imagery analysis for disaster relief, and in the medical field, segmenting cellular images and aiding in detecting skin cancer.

As Mark Zuckerberg noted in an open letter last week, open source AI “has more potential than any other modern technology to increase human productivity, creativity, and quality of life,” all while accelerating economic growth and advancing groundbreaking medical and scientific research. We’ve been tremendously impressed by the progress the AI community has made using SAM, and we envisage SAM 2 unlocking even more exciting possibilities.


SAM 2 can be applied out of the box to a diverse range of real-world use cases—for example, tracking objects to create video effects (left) or segmenting moving cells in videos captured from a microscope to aid in scientific research (right).

In keeping with our open science approach, we’re sharing our research on SAM 2 with the community so they can explore new capabilities and use cases. The artifacts we’re sharing today include:

  • The SAM 2 code and weights, which are being open sourced under a permissive Apache 2.0 license. We’re sharing our SAM 2 evaluation code under a BSD-3 license.
  • The SA-V dataset, which has 4.5 times more videos and 53 times more annotations than the existing largest video segmentation dataset. This release includes ~51k real-world videos with more than 600k masklets. We’re sharing SA-V under a CC BY 4.0 license.
  • A web demo, which enables real-time interactive segmentation of short videos and applies video effects on the model predictions.

As a unified model, SAM 2 can power use cases seamlessly across image and video data and be extended to previously unseen visual domains. For the AI research community and others, SAM 2 could be a component as part of a larger AI system for a more general multimodal understanding of the world. In industry, it could enable faster annotation tools for visual data to train the next generation of computer vision systems, such as those used in autonomous vehicles. SAM 2’s fast inference capabilities could inspire new ways of selecting and interacting with objects in real time or live video. For content creators, SAM 2 could enable creative applications in video editing and add controllability to generative video models. SAM 2 could also be used to aid research in science and medicine—for example, tracking endangered animals in drone footage or localizing regions in a laparoscopic camera feed during a medical procedure. We believe the possibilities are broad, and we’re excited to share this technology with the AI community to see what they build and learn.


How we built SAM 2


SAM was able to learn a general notion of what objects are in images. However, images are only a static snapshot of the dynamic real world in which visual segments can exhibit complex motion. Many important real-world use cases require accurate object segmentation in video data, for example in mixed reality, robotics, autonomous vehicles, and video editing. We believe that a universal segmentation model should be applicable to both images and video.

An image can be considered a very short video with a single frame. We adopt this perspective to develop a unified model that supports both image and video input seamlessly. The only difference in handling video is that the model needs to rely on memory to recall previously processed information for that video in order to accurately segment an object at the current timestep.

Successful segmentation of objects in video requires an understanding of where entities are across space and time. Compared to segmentation in images, videos present significant new challenges. Object motion, deformation, occlusion, lighting changes, and other factors can drastically change from frame to frame. Videos are often lower quality than images due to camera motion, blur, and lower resolution, adding to the difficulty. As a result, existing video segmentation models and datasets have fallen short in providing a comparable “segment anything” capability for video. We solved many of these challenges in our work to build SAM 2 and the new SA-V dataset.

Similar to the methodology we used for SAM, our research on enabling video segmentation capabilities involves designing a new task, a model, and a dataset. We first develop the promptable visual segmentation task and design a model (SAM 2) capable of performing this task. We use SAM 2 to aid in creating a video object segmentation dataset (SA-V), which is an order of magnitude larger than anything that exists currently, and use this to train SAM 2 to achieve state-of-the-art performance.

Promptable visual segmentation

SAM 2 supports selecting and refining objects in any video frame.

We design a promptable visual segmentation task that generalizes the image segmentation task to the video domain. SAM was trained to take as input points, boxes, or masks in an image to define the target object and predict a segmentation mask. With SAM 2, we train it to take input prompts in any frame of a video to define the spatio-temporal mask (i.e. a “masklet”) to be predicted. SAM 2 makes an immediate prediction of the mask on the current frame based on the input prompt and temporally propagates it to generate the masklet of the target object across all video frames. Once an initial masklet has been predicted, it can be iteratively refined by providing additional prompts to SAM 2 in any frame. This can be repeated as many times as required until the desired masklet is obtained.


Image and video segmentation in a unified architecture

The evolution of the architecture from SAM to SAM 2.

The SAM 2 architecture can be seen as a generalization of SAM from the image to the video domain. SAM 2 can be prompted by clicks (positive or negative), bounding boxes, or masks to define the extent of the object in a given frame. A lightweight mask decoder takes an image embedding for the current frame and encoded prompts to output a segmentation mask for the frame. In the video setting, SAM 2 propagates this mask prediction to all video frames to generate a masklet. Prompts can then be iteratively added on any subsequent frame to refine the masklet prediction.

To predict masks accurately across all video frames, we introduce a memory mechanism consisting of a memory encoder, a memory bank, and a memory attention module. When applied to images, the memory components are empty and the model behaves like SAM. For video, the memory components enable storing information about the object and previous user interactions in that session, allowing SAM 2 to generate masklet predictions throughout the video. If there are additional prompts provided on other frames, SAM 2 can effectively correct its predictions based on the stored memory context of the object.

Memories of frames are created by the memory encoder based on the current mask prediction and placed in the memory bank for use in segmenting subsequent frames. The memory bank consists of both memories from previous frames and prompted frames. The memory attention operation takes the per-frame embedding from the image encoder and conditions it on the memory bank to produce an embedding that is then passed to the mask decoder to generate the mask prediction for that frame. This is repeated for all subsequent frames.

We adopt a streaming architecture, which is a natural generalization of SAM to the video domain, processing video frames one at a time and storing information about the segmented objects in the memory. On each newly processed frame, SAM 2 uses the memory attention module to attend to the previous memories of the target object. This design allows for real-time processing of arbitrarily long videos, which is important not only for annotation efficiency in collecting the SA-V dataset but also for real-world applications—for example, in robotics.

SAM introduced the ability to output multiple valid masks when faced with ambiguity about the object being segmented in an image. For example, when a person clicks on the tire of a bike, the model can interpret this click as referring to only the tire or the entire bike and output multiple predictions. In videos, this ambiguity can extend across video frames. For example, if in one frame only the tire is visible, a click on the tire might relate to just the tire, or as more of the bike becomes visible in subsequent frames, this click could have been intended for the entire bike. To handle this ambiguity, SAM 2 creates multiple masks at each step of the video. If further prompts don’t resolve the ambiguity, the model selects the mask with the highest confidence for further propagation in the video.


The occlusion head in the SAM 2 architecture is used to predict if an object is visible or not, helping segment objects even when they become temporarily occluded.

In the image segmentation task, there is always a valid object to segment in a frame given a positive prompt. In video, it’s possible for no valid object to exist on a particular frame, for example due to the object becoming occluded or disappearing from view. To account for this new output mode, we add an additional model output (“occlusion head”) that predicts whether the object of interest is present on the current frame. This enables SAM 2 to effectively handle occlusions.

SA-V: Building the largest video segmentation dataset

Videos and masklet annotations from the SA-V dataset.

One of the challenges of extending the “segment anything” capability to video is the limited availability of annotated data for training the model. Current video segmentation datasets are small and lack sufficient coverage of diverse objects. Existing dataset annotations typically cover entire objects (e.g., person), but lack object parts (e.g., person’s jacket, hat, shoes), and datasets are often centered around specific object classes, such as people, vehicles, and animals.

To collect a large and diverse video segmentation dataset, we built a data engine, leveraging an interactive model-in-the-loop setup with human annotators. Annotators used SAM 2 to interactively annotate masklets in videos, and then the newly annotated data was used to update SAM 2 in turn. We repeated this cycle many times to iteratively improve both the model and dataset. Similar to SAM, we do not impose semantic constraints on the annotated masklets and focus on both whole objects (e.g., a person) and object parts (e.g., a person’s hat).

With SAM 2, collecting new video object segmentation masks is faster than ever before. Annotation with our tool and SAM 2 in the loop is approximately 8.4 times faster than using SAM per frame and also significantly faster than combining SAM with an off-the-shelf tracker.

Our released SA-V dataset contains over an order of magnitude more annotations and approximately 4.5 times more videos than existing video object segmentation datasets.

Highlights of the SA-V dataset include:

  • More than 600,000 masklet annotations on approximately 51,000 videos.
  • Videos featuring geographically diverse, real-world scenarios, collected across 47 countries.
  • Annotations that cover whole objects, object parts, and challenging instances where objects become occluded, disappear, and reappear.

Results

Both models are initialized with the mask of the t-shirt in the first frame. For the baseline, we use the mask from SAM. SAM 2 is able to track object parts accurately throughout a video, compared to the baseline which over-segments and includes the person’s head instead of only tracking the t-shirt.

To create a unified model for image and video segmentation, we jointly train SAM 2 on image and video data by treating images as videos with a single frame. We leverage the SA-1B image dataset released last year as part of the Segment Anything project, the SA-V dataset, and an additional internal licensed video dataset.


SAM 2 (right) improves on SAM’s (left) object segmentation accuracy in images.

Key highlights that we detail in our research paper include:


  • SAM 2 significantly outperforms previous approaches on interactive video segmentation across 17 zero-shot video datasets and requires approximately three times fewer human-in-the-loop interactions.
  • SAM 2 outperforms SAM on its 23 dataset zero-shot benchmark suite, while being six times faster.
  • SAM 2 excels at existing video object segmentation benchmarks (DAVIS, MOSE, LVOS, YouTube-VOS) compared to prior state-of-the-art models.
  • Inference with SAM 2 feels real-time at approximately 44 frames per second.
  • SAM 2 in the loop for video segmentation annotation is 8.4 times faster than manual per-frame annotation with SAM.

It’s important that we work to build AI experiences that work well for everyone. In order to measure the fairness of SAM 2, we conducted an evaluation on model performance across certain demographic groups. Our results show that the model has minimal performance discrepancy in video segmentation on perceived gender and little variance among the three perceived age groups we evaluated: ages 18 – 25, 26 – 50, and 50+.


Limitations


While SAM 2 demonstrates strong performance for segmenting objects in images and short videos, the model performance can be further improved—especially in challenging scenarios.

SAM 2 may lose track of objects across drastic camera viewpoint changes, after long occlusions, in crowded scenes, or in extended videos. We alleviate this issue in practice by designing the model to be interactive and enabling manual intervention with correction clicks in any frame so the target object can be recovered.


SAM 2 can sometimes confuse multiple similar looking objects in crowded scenes.

When the target object is only specified in one frame, SAM 2 can sometimes confuse objects and fail to segment the target correctly, as shown with the horses in the above video. In many cases, with additional refinement prompts in future frames, this issue can be entirely resolved and the correct masklet can be obtained throughout the video.

While SAM 2 supports the ability to segment multiple individual objects simultaneously, the efficiency of the model decreases considerably. Under the hood, SAM 2 processes each object separately, utilizing only shared per-frame embeddings, without inter-object communication. While this simplifies the model, incorporating shared object-level contextual information could aid in improving efficiency.


SAM 2 predictions can miss fine details in fast moving objects.

For complex fast moving objects, SAM 2 can sometimes miss fine details and the predictions can be unstable across frames (as shown in the video of the cyclist above). Adding further prompts to refine the prediction in the same frame or additional frames can only partially alleviate this problem During training we do not enforce any penalty on the model predictions if they jitter between frames, so temporal smoothness is not guaranteed. Improving this capability could facilitate real-world applications that require detailed localization of fine structures.

While our data engine uses SAM 2 in the loop and we’ve made significant strides in automatic masklet generation, we still rely on human annotators for some steps such as verifying masklet quality and selecting frames that require correction. Future developments could include further automating the data annotation process to enhance efficiency.

There’s still plenty more work to be done to propel this research even further. We hope the AI community will join us by building with SAM 2 and the resources we’ve released. Together, we can accelerate open science to build powerful new experiences and use cases that benefit people and society.

Putting SAM 2 to work

While many of Meta FAIR’s models used in public demos are hosted on Amazon SageMaker, the session-based requirements of the SAM 2 model pushed up against the boundaries of what our team believed was previously possible on AWS AI Infra. Thanks to the advanced model deployment and managed inference capabilities offered by Amazon SageMaker, we’ve been able to make the SAM 2 release possible—focusing on building state of the art AI models and unique AI demo experiences.


In the future, SAM 2 could be used as part of a larger AI system to identify everyday items via AR glasses that could prompt users with reminders and instructions.

We encourage the AI community to download the model, use the dataset, and try our demo. By sharing this research, we hope to contribute to accelerating progress in universal video and image segmentation and related perception tasks. We look forward to seeing the new insights and useful experiences that will be created by releasing this research to the community.


Share:

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Computer Vision
Introducing Segment Anything: Working toward the first foundation model for image segmentation
April 5, 2023
FEATURED
Research
MultiRay: Optimizing efficiency for large-scale AI models
November 18, 2022
FEATURED
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023