Research

How Meta Segment Anything Model enables Cutouts in the Instagram Edits app

May 1, 2025
4 minute read

We recently launched Edits, a new video creation app by Instagram designed for creators. Edits offers a full solution for short-form video creation for mobile-first creators and is available globally on iOS and Android. One of the standout features of our new app is Cutouts, which is enabled by Meta Segment Anything Model (SAM) 2.1, the popular open source segmentation model created by the Meta Fundamental AI Research (FAIR) team.

“In 2024, we built a demo as part of our research and as a way to showcase SAM 2 externally to a research audience, independent developers, and the general population,” says Nikhila Ravi, Research Engineering Manager, Meta. “We developed the demo from our perspective, but it was also clear that this could have a lot of practical value for the people who use Meta technologies.”

Less than a year later, the research developed as Segment Anything Model 2.1 is now an important part of Edits. People can use Cutouts to edit across several layers of video, apply filters to specific parts of videos, and easily place elements like text and stickers behind objects. In the first 24 hours after Edits launched, Cutouts was used hundreds of thousands of times. We see this as an impactful tool for video creation, and it’s easily accessible to creators in Edits without any expensive software or advanced editing expertise.

While the experience feels seamless in the app, the Meta FAIR team worked behind the scenes to make sure the Segment Anything Model was ready to be shipped into Edits.

“There are three main steps: first, a user needs to be able to select the object interactively and correctly,” Ravi explains. “Then they need to be able to track the object through the video correctly, even when the object goes out of frame. And finally, we need to be able to run the SAM 2.1 model fast enough to give the user a real-time experience.”

The Cutouts feature in Edits uses an object detection pipeline to automatically suggest an object in a frame of a video that someone might want to turn into a cutout. A user can also switch to manual mode, which allows interactively adding positive clicks to select regions to include in the cutout, as well as negative clicks for regions to exclude. Segment Anything Model 2.1 predicts a high-quality mask, which defines the boundary of the object in the selected frame. After that, the real creative fun begins. You can hit “track” and SAM 2.1 tracks the object, predicting a consistent mask in every frame of the video to generate the cutout. Once you’ve got the cutout, it can be added to new layers and remixed and edited in creative ways with the numerous other tools offered in Edits.

While Segment Anything Model 2 introduced real-time, promptable segmentation for video, the FAIR team improved upon its capabilities with the release of SAM 2.1 in fall 2024. The update included the introduction of additional data augmentation techniques to simulate the presence of visually similar objects and small objects where SAM 2 previously struggled. Segment Anything Model 2.1 also improved SAM 2’s occlusion handling capability by training the model on longer sequences of frames and making some tweaks to positional encoding of spatial and object pointer memory. This update enables the Cutouts feature to perform well, even when the object being tracked is hidden or out of frame.

Working with PyTorch and production partners, we made several performance improvements that targeted both inference speed and latency. On an NVIDIA H100 GPU, we increased model throughput by 1.8x and reduced end-to-end first frame preview latency by 3x, ensuring people using the app have a good experience. We also shipped speed improvements to our SAM 2 open source repo on GitHub.

“Initially we thought we would need to pursue more aggressive methods to increase model efficiency like quantization, but we were pleasantly surprised to see how effective Torch Inductor was at optimizing model throughput with a minimal amount of code modification," says Joseph Greer, Research Scientist, Meta.

With more people than ever now using the Segment Anything Model, the team is focused on their next big release: SAM 3. The next-generation model will be our first to automatically detect, segment, and track objects in images and videos using open vocabulary text or click prompts, opening up new possibilities across industries including image and video editing tools. If you’re interested in learning more, join our waitlist to receive the latest updates.

Join us in the pursuit of what’s possible with AI.