FEATURED
Research
How Meta Movie Gen could usher in a new AI-enabled era for content creators
October 4, 2024
6 minute read

Whether a person is an aspiring filmmaker hoping to make it in Hollywood or a creator who enjoys making videos for their audience, we believe everyone should have access to tools that help enhance their creativity. Today, we’re excited to premiere Meta Movie Gen, our breakthrough generative AI research for media, which includes modalities like image, video, and audio. Our latest research demonstrates how you can use simple text inputs to produce custom videos and sounds, edit existing videos, and transform your personal image into a unique video. Movie Gen outperforms similar models in the industry across these tasks when evaluated by humans.

This work is part of our long and proven track record of sharing fundamental AI research with the community. Our first wave of generative AI work started with the Make-A-Scene series of models that enabled the creation of image, audio, video, and 3D animation. With the advent of diffusion models, we had a second wave of work with Llama Image foundation models, which enabled higher quality generation of images and video, as well as image editing. Movie Gen is our third wave, combining all of these modalities and enabling further fine-grained control for the people who use the models in a way that’s never before been possible. Similar to previous generations, we anticipate these models enabling various new products that could accelerate creativity.

While there are many exciting use cases for these foundation models, it’s important to note that generative AI isn’t a replacement for the work of artists and animators. We’re sharing this research because we believe in the power of this technology to help people express themselves in new ways and to provide opportunities to people who might not otherwise have them. Our hope is that perhaps one day in the future, everyone will have the opportunity to bring their artistic visions to life and create high-definition videos and audio using Movie Gen.

Behind the curtain

As the most advanced and immersive storytelling suite of models, Movie Gen has four capabilities: video generation, personalized video generation, precise video editing, and audio generation. We’ve trained these models on a combination of licensed and publicly available datasets. While we’re sharing more technical detail in our research paper, we’re excited to share in this blog post how each of these capabilities performs.


Video generation

Given a text prompt, we can leverage a joint model that has been optimized for both text-to-image and text-to-video to create high-quality and high-definition images and videos. This 30B parameter transformer model has the ability to generate videos of up to 16 seconds at a rate of 16 frames per second. We find that these models can reason about object motion, subject-object interactions, and camera motion, and they can learn plausible motions for a wide variety of concepts—making them state-of-the-art models in their category.


Personalized Videos

We also expanded the above foundation model to support personalized video generation. We take as input a person’s image and combine it with a text prompt to generate a video that contains the reference person and rich visual details informed by the text prompt. Our model achieves state-of-the-art results when it comes to creating personalized videos that preserve human identity and motion.


Precise video editing

The editing variant of the same foundation model takes both video and text prompt as input, executing tasks with precision to generate the desired output. It combines video generation with advanced image editing, performing localized edits like adding, removing, or replacing elements, and global changes such as background or style modifications. Unlike traditional tools that require specialized skills or generative ones that lack precision, Movie Gen preserves the original content, targeting only the relevant pixels.


Audio generation

Finally, we trained a 13B parameter audio generation model that can take a video and optional text prompts and generate high-quality and high-fidelity audio up to 45 seconds, including ambient sound, sound effects (Foley), and instrumental background music—all synced to the video content. Further, we introduce an audio extension technique that can generate coherent audio for videos of arbitrary lengths—overall achieving state-of-the-art performance in audio quality, video-to-audio alignment, and text-to-audio alignment.


Results

These foundation models required us to push on multiple technical innovations on architecture, training objectives, data recipes, evaluation protocols, and inference optimizations.

Below, we present A/B human evaluation comparisons across our four capabilities. Positive net win rates correspond to humans preferring the results of our model against competing industry models. For further details and evaluation, please refer to our paper.



While the research we're sharing today shows tremendous potential for future applications, we acknowledge that our current models have limitations. Notably, there are lots of optimizations we can do to further decrease inference time and improve the quality of the models by scaling up further.

The road ahead

As we continue to improve our models and move toward a potential future release, we’ll work closely with filmmakers and creators to integrate their feedback. By taking a collaborative approach, we want to ensure we’re creating tools that help people enhance their inherent creativity in new ways they may have never dreamed would be possible. Imagine animating a “day in the life” video to share on Reels and editing it using text prompts, or creating a customized animated birthday greeting for a friend and sending it to them on WhatsApp. With creativity and self-expression taking charge, the possibilities are infinite.


Share:

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Responsible AI
Connect 2024: The responsible approach we’re taking to generative AI
September 25, 2024
FEATURED
Open Source
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
September 25, 2024
Research
Sharing new research, models, and datasets from Meta FAIR
June 18, 2024