The field of generative AI is rapidly evolving, showing remarkable potential to augment human creativity and self-expression. In 2022, we made the leap from image generation to video generation in the span of a few months. And at this year’s Meta Connect, we announced several new developments, including Emu, our first foundational model for image generation. Technology from Emu underpins many of our generative AI experiences, some AI image editing tools for Instagram that let you take a photo and change its visual style or background, and the Imagine feature within Meta AI that lets you generate photorealistic images directly in messages with that assistant or in group chats across our family of apps. Our work in this exciting field is ongoing, and today, we’re announcing new research into controlled image editing based solely on text instructions and a method for text-to-video generation based on diffusion models.
Emu Video: A simple factorized method for high-quality video generation
Whether or not you’ve personally used an AI image generation tool, you’ve likely seen the results: Visually distinct, often highly stylized and detailed, these images on their own can be quite striking—and the impact increases when you bring them to life by adding movement.
With Emu Video, which leverages our Emu model, we present a simple method for text-to-video generation based on diffusion models. This is a unified architecture for video generation tasks that can respond to a variety of inputs: text only, image only, and both text and image. We’ve split the process into two steps: first, generating images conditioned on a text prompt, and then generating video conditioned on both the text and the generated image. This “factorized” or split approach to video generation lets us train video generation models efficiently. We show that factorized video generation can be implemented via a single diffusion model. We present critical design decisions, like adjusting noise schedules for video diffusion, and multi-stage training that allows us to directly generate higher-resolution videos.
Unlike prior work that requires a deep cascade of models (e.g., five models for Make-A-Video), our state-of-the-art approach is simple to implement and uses just two diffusion models to generate 512x512 four-second long videos at 16 frames per second. In human evaluations, our video generations are strongly preferred compared to prior work—in fact, this model was preferred over Make-A-Video by 96% of respondents based on quality and by 85% of respondents based on faithfulness to the text prompt. Finally, the same model can “animate” user-provided images based on a text prompt where it once again sets a new state-of-the-art outperforming prior work by a significant margin.
Emu Edit: Precise image editing via recognition and generation tasks
Of course, the use of generative AI is often a process. You try a prompt, the generated image isn’t quite what you had in mind, so you continue tweaking the prompt until you get to a more desired outcome. That’s why prompt engineering has become a thing. And while instructable image generative models have made significant strides in recent years, they still face limitations when it comes to offering precise control. That’s why we’re introducing Emu Edit, a novel approach that aims to streamline various image manipulation tasks and bring enhanced capabilities and precision to image editing.
Emu Edit is capable of free-form editing through instructions, encompassing tasks such as local and global editing, removing and adding a background, color and geometry transformations, detection and segmentation, and more. Current methods often lean towards either over-modifying or under-performing on various editing tasks. We argue that the primary objective shouldn’t just be about producing a “believable” image. Instead, the model should focus on precisely altering only the pixels relevant to the edit request. Unlike many generative AI models today, Emu Edit precisely follows instructions, ensuring that pixels in the input image unrelated to the instructions remain untouched. For instance, when adding the text “Aloha!” to a baseball cap, the cap itself should remain unchanged.
Our key insight is that incorporating computer vision tasks as instructions to image generation models offers unprecedented control in image generation and editing. Through a detailed examination of both local and global editing tasks, we highlight the vast potential of Emu Edit in executing detailed edit instructions.
In order to train the model, we’ve developed a dataset that contains 10 million synthesized samples, each including an input image, a description of the task to be performed, and the targeted output image. We believe it’s the largest dataset of its kind to date. As a result, our model displays unprecedented edit results in terms of both instruction faithfulness and image quality. In our evaluations, Emu Edit demonstrates superior performance over current methods, producing new state-of-the-art results in both qualitative and quantitative evaluations for a range of image editing tasks.
The road ahead
Although this work is purely fundamental research right now, the potential use cases are clearly evident. Imagine generating your own animated stickers or clever GIFs on the fly to send in the group chat rather than having to search for the perfect media for your reply. Or editing your own photos and images, no technical skills required. Or adding some extra oomph to your Instagram posts by animating static photos. Or generating something entirely new.
While certainly no replacement for professional artists and animators, Emu Video, Emu Edit, and new technologies like them could help people express themselves in new ways—from an art director ideating on a new concept or a creator livening up their latest reel to a best friend sharing a unique birthday greeting. And we think that’s something worth celebrating.