July 14, 2022
Creative expression is central to human connection, and using artificial intelligence (AI) to augment human creativity is a powerful use of technology — whether it’s by generating expressive avatars, animating children's drawings, creating new virtual worlds in the metaverse, or producing stunning digital artwork using just text descriptions.
It’s not enough for an AI system to just generate content, though. To realize AI’s potential to push creative expression forward, people should be able to shape and control the content a system generates. It should be intuitive and easy to use so people can leverage whatever modes of expression work best for them, whether speech, text, gestures, eye movements, or even sketches, to bring their vision to life in whatever mediums work best for them, including audio, images, animations, video, and 3D. Imagine creating beautiful impressionist paintings in compositions you envision without ever picking up a paintbrush. Or instantly generating imaginative storybook illustrations to accompany the words.
Today, we’re showcasing an exploratory AI research concept called Make-A-Scene that demonstrates AI’s potential for empowering anyone to bring their imagination to life. This multimodal generative AI method puts creative control in the hands of people who use it by allowing them to describe and illustrate their vision through both text descriptions and freeform sketches.
Prior state-of-the-art AI systems that generated awe-inspiring images primarily used a text description as input. But text prompts, like “a painting of a zebra riding a bike,” generate images with compositions that can be difficult to predict. The zebra might be on the left side of the image or the right, for example, or it might be much bigger than the bicycle or much smaller, or the zebra and bicycle may be facing the camera or facing sideways. As a result, the image may not be a reflection of a person’s creative voice, and they may not feel a strong sense of pride and ownership over the content. If, for instance, you wanted to specify the relative size of the bicycle wheel, orientation of the handle bars, and the width of the road, there is no easy way to convey all of these elements using just a text description.
With Make-A-Scene, this is no longer the case. It demonstrates how people can use both text and simple drawings to convey their vision with greater specificity, using a variety of elements, forms, arrangements, depth, compositions, and structures.
We validated this premise using human evaluators. Each was shown two images generated by Make-A-Scene: one generated from only a text prompt, and one from both a sketch and a text prompt. The latter used the segmentation map of an image from a public dataset as the sketch. Both used the corresponding image caption as the text input. We found that the image generated from both text and sketch was almost always (99.54 percent of the time) rated as better aligned with the original sketch. It was often (66.3 percent of the time) more aligned with the text prompt too. This demonstrates that Make-A-Scene generations are indeed faithful to a person’s vision communicated via the sketch.
Like other generative AI models, Make-A-Scene learns the relationship between visuals and text by training on millions of example images. Among other factors, the bias reflected in the training data affects the output of those models. The AI industry is still in early days of understanding and addressing these challenges, and there’s a lot more work to be done. We believe transparency will accelerate progress toward this. As a step toward promoting transparency in this research direction, we used publicly available datasets to train Make-A-Scene to help the broader AI community analyze, study, and understand the existing biases of the system.
Make-A-Scene uses a novel intermediate representation that captures the scene layout to enable nuanced sketches as input. It can also generate its own scene layout with text-only prompts, if that’s what the creator chooses. The model focuses on learning key aspects of the imagery that are more likely to be important to the creator, such as objects or animals. This technique helped increase the generation quality, as evaluated by the widely used FID score, which assesses the quality of images created by generative models.
So, how exactly would people use Make-A-Scene to bring their imaginations to life? As part of our research and development process, we’re sharing access to our Make-A-Scene demo with well-known AI artists, including Sofia Crespo, Scott Eaton, Alexander Reben, and Refik Anadol — all of whom have experience using state-of-the-art generative AI. We asked these artists to use Make-A-Scene as part of their creative process and to provide feedback along the way.
Crespo, for instance, is a generative artist focusing on the intersection between nature and technology. She’s interested in imagining artificial life forms that have never existed, and she used Make-A-Scene’s sketch and text prompts to create new hybrid creatures, like jellyfish in the shape of a flower. Using its freeform drawing capabilities, she found that she could iterate quickly across new ideas. “It’s going to help move creativity a lot faster and help artists work with interfaces that are more intuitive,” Crespo says.
Eaton is an artist, an educator, and a creative technologist whose work investigates contemporary situations and relationships with our technologies. He similarly leveraged Make-A-Scene as a way to deliberately compose scenes but still explore variations by experimenting with different prompts, like “skyscrapers sunken and decaying in the desert” to highlight the climate crisis.
Reben is an artist, researcher, and roboticist, who says that having more control over the output really helps get your artistic intent through. He incorporated the tools in his ongoing series focused on creating art in real-life as described by AI systems. In this case, he used AI-generated text from another AI system, created a sketch to interpret that text, and used that sketch as well as the text as input for Make-A-Scene. “It made quite a difference to be able to sketch things in, especially to tell the system where you wanted things to give it suggestions of where things should go, but still be surprised at the end,” Reben says.
For media artist and director Refik Anadol, the tool was a way to prompt an imagination and explore uncharted territories. “I was prompting ideas, mixing and matching different worlds — you are literally dipping the brush in the mind of a machine and painting with machine consciousness,” he says.
The prototype tool is not just for people with a penchant for art. We believe it could help anyone better express themselves, including people without artistic skill sets. As a starting point, we’ve provided access on a limited basis to Meta employees who are testing and providing feedback about their experience with Make-A-Scene. Andy Boyatzis, a program manager at Meta, used Make-A-Scene to generate art with his young children of ages two and four. They used playful drawings to bring their ideas and imagination to life.
Trying new tools like Make-A-Scene is a fundamental way for our employees to stay connected to the cutting-edge AI research at Meta and to have influence over how we improve exploratory concepts that could impact different types of generative AI tools developed and released in production in the future.
Through scientific research and exploratory projects like Make-A-Scene, we believe we can expand the boundaries of creative expression — regardless of artistic ability. We want to make it as easy for people to bring their vision to life in the physical world and in the metaverse as it is to post across our apps today. This research endeavor is part of Meta’s commitment to exploring ways in which AI can empower creativity – whether that’s bringing your 2D sketches to life, using natural language among other modalities to create 3D objects, building entire virtual spaces, or any other creative project. It could one day enable entirely new forms of AI-powered expression, while putting creators and their vision at the center of the process — whether that’s an art director ideating on their next creative campaign, a social media influencer creating more personalized content, an author developing unique illustrations for their books and stories, or just someone sharing a fun, unique greeting for a friend’s birthday.
We’re making progress in this space, but this is just the beginning. We’ll continue to push the boundaries of what’s possible using this new class of generative creative tools to build methods for richer, more expressive messaging in 2D, 3D, and general communications between people in mixed reality and virtual worlds.
Since the research paper was released, Make-A-Scene has incorporated a super resolution network that generates imagery at 2048 x 2048, 4x the resolution, and we’re continuously improving our generative AI models. We aim to provide broader access to our research demos in the future to give more people the opportunity to be in control of their own creations and unlock entirely new forms of expression.
In the meantime, check out the wide range of fascinating outputs from Make-A-Scene below. And you can catch our oral presentation at this year’s ECCV conference held in Tel Aviv on October 23 - 27, 2022.
This blog post reflects the research contributions of Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman.
We would also like to acknowledge Kelly Freed, Deb Banerji, Somya Jain, Sasha Sheng, Maria Ruiz, and Aran Mun. Thank you for your contributions!
Foundational models
Latest news
Foundational models