Interest and research in generative AI models has accelerated in recent months with advancements in natural language processing that lets machines understand and express language, as well as systems that can generate images based on text input. Today, we’re showcasing CM3leon (pronounced like “chameleon”), a single foundation model that does both text-to-image and image-to-text generation.
CM3leon is the first multimodal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multitask supervised fine-tuning (SFT) stage. This recipe is simple, produces a strong model, and also shows that tokenizer-based transformers can be trained as efficiently as existing generative diffusion-based models. CM3leon achieves state-of-the-art performance for text-to-image generation, despite being trained with five times less compute than previous transformer-based methods. CM3leon has the versatility and effectiveness of autoregressive models, while maintaining low training costs and inference efficiency. It is a causal masked mixed-modal (CM3) model because it can generate sequences of text and images conditioned on arbitrary sequences of other image and text content. This greatly expands the functionality of previous models that were either only text-to-image or only image-to-text.
Although text-only generative models are commonly multitask instruction-tuned on a wide range of different tasks to improve their ability to follow instruction prompts, image generation models are instead typically specialized for particular tasks. We apply large-scale multitask instruction tuning to CM3leon for both image and text generation, and show that it significantly improves performance on tasks such as image caption generation, visual question answering, text-based editing, and conditional image generation. This provides another strong example of how the scaling recipes developed for text-only models generalize directly to our tokenization-based image generation models.
When comparing performance on the most widely used image generation benchmark (zero-shot MS-COCO), CM3Leon achieves an FID (Fréchet Inception Distance) score of 4.88, establishing a new state of the art in text-to-image generation and outperforming Google’s text-to-image model, Parti. This achievement underscores the potential of retrieval augmentation and highlights the impact of scaling strategies on the performance of autoregressive models. CM3Leon also shows an impressive ability to generate complex compositional objects, such as the potted cactus with sunglasses and a hat in the examples below. CM3leon performs well across a variety of vision-language tasks, including visual question answering and long-form captioning. Even with training on a dataset comprised of only three billion text tokens, CM3Leon's zero-shot performance compares favorably against larger models trained on more extensive datasets.
How CM3leon performs across tasks
With CM3leon’s capabilities, image generation tools can produce more coherent imagery that better follows the input prompts. For example, many image generation models struggle with the capacity to recover global shapes and local details. CM3leon performs strongly in this area. Here’s a look at CM3leon’s capabilities across a variety of tasks — all performed with a single model:
Text-guided image generation and editing
Image generation can be challenging when it comes to complex objects or when the prompt includes many constraints that must all be included in the output. Text-guided image editing (e.g. “change the color of the sky to bright blue”) is challenging because it requires the model to simultaneously understand both textual instructions and visual content. CM3leon excels in all of the cases, as seen in the examples below.
Given prompt text with potentially highly compositional structure, generate a coherent image that follows the prompt. For example, the following four images were created for the prompts: (1) A small cactus wearing a straw hat and neon sunglasses in the Sahara desert. (2) A close-up photo of a human hand, hand model. High quality. (3) A raccoon main character in an Anime preparing for an epic battle with a samurai sword. Battle stance. Fantasy, Illustration. (4) A stop sign in a Fantasy style with the text “1991.”
Text-guided image editing
Given an image and a text prompt, edit the image according to the instructions in the text. Because of the generality of the CM3leon models, this was achieved with the same model as all of the other tasks above and below, unlike previous models such as InstructPix2Pix which are tuned only for text-guided image editing.
The CM3leon model can also follow a range of different prompts to generate short or long captions and answer questions about an image.
For example, imagine a dog carrying a stick.
Prompt Question: What is the dog carrying?
Model Generation: Stick
Prompt: Describe the given image in very fine detail.
Model Generation: In this image, there is a dog holding a stick in its mouth. There is grass on the surface. In the background of the image, there are trees.
We empirically evaluated our instruction-tuned models on various image captioning generation and visual question answering tasks and compared them with previous state-of-the-art benchmarks. Even though CM3leon models saw significantly less text data (~3B tokens) compared to Flamingo (100B) and OpenFlamingo (40B), they match the zero-shot performance levels of OpenFlamingo on MS-COCO captioning and VQA2 question answering and even beat Flamingo by nearly 10 points on the VizWiz task.
Structure-guided image editing
Structure-guided image editing involves understanding and interpreting not only textual instructions but also structural or layout information that’s provided as input. This enables CM3leon models to create visually coherent and contextually appropriate edits to an image while adhering to the given structure or layout guidelines.
Given a text description of the bounding box segmentation of the image, generate an image.
Given an image containing only the segmentation (with no text classes), generate an image. The input here denotes the image from which we extract the segmentation.
All of the generated images above show raw outputs from the CM3leon model. However, a common trick for image generation is to add a separately trained super-resolution stage to produce higher-resolution images from the original model outputs. This works very well with CM3leon too, as we show in the examples below for the text-to-image generation task.
Four example images for each of the prompts: (1) A steaming cup of coffee with mountains in the background. Resting during road trip. (2) Beautiful, majestic road during sunset. Aesthetic. (3) Small circular island in the middle of a lake. Forests surrounding the lake. High Contrast.
More examples for the prompts: (1) Turtle swimming underwater. Aesthetic. Fantasy. (2) Elephant swimming underwater. Aesthetic. Fantasy. (3) Flock of sheep. Aesthetic. Fantasy.
How we built CM3leon
CM3Leon's architecture uses a decoder-only transformer akin to well-established text-based models. However, what sets CM3Leon apart is its ability to input and generate both text and images. This empowers CM3Leon to successfully handle the variety of tasks we shared above.
CM3leon’s training retrieval augmented, following our recent work, greatly improving efficiency and controllability of the resulting model. Finally, as described above, we performed instruction fine-tuning on a wide range of different image and text generation tasks.
As the AI industry continues to evolve, generative models like CM3leon are becoming increasingly sophisticated. These models learn the relationship between visuals and text by training on millions of example images, but they can also reflect any biases present in the training data. While the industry is still in its early stages of understanding and addressing these challenges, we believe that transparency will be key to accelerating progress.
As such, and as described in our paper, we’ve trained CM3leon using a licensed dataset. This demonstrates that strong performance is possible with a very different data distribution from what all previous models used. By making our work transparent, we hope to encourage collaboration and innovation in the field of generative AI. We believe that by working together, we can create models that are not only more accurate, but also more fair and equitable for everyone.
Paving the way for multimodal language models
With the goal of creating high-quality generative models, we believe CM3leon’s strong performance across a variety of tasks is a step toward higher-fidelity image generation and understanding. Models like CM3leon could ultimately help boost creativity and better applications in the metaverse. We look forward to exploring the boundaries of multimodal language models and releasing more models in the future.