Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

September 27, 2023


Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on 1.1 billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of 82.9% compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred 68.4% and 71.3% of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.

Download the Paper


Written by

Xiaoliang Dai

Ji Hou

Kevin Chih-Yao Ma

Sam Tsai

Jialiang Wang

Rui Wang

Peizhao Zhang

Simon Vandenhende

Xiaofang Wang

Abhimanyu Dubey

Matthew Yu

Abhishek Kadian

Filip Radenovic

Dhruv Mahajan

Kunpeng Li

Yue (R) Zhao

Vladan Petrovic

Mitesh Kumar Singh

Simran Motwani

Yiwen Song

Yi Wen

Roshan Sumbaly

Vignesh Ramanathan

Zijian He

Peter Vajda

Devi Parikh



Research Topics

Computer Vision

Related Publications

June 17, 2024


Move Anything with Layered Scene Diffusion

Jiawei Ren, Frost Xu, Jerry Wu, Ziwei Liu, Tao Xiang, Antoine Toisoul

June 17, 2024

June 14, 2024


Decomposed evaluations of geographic disparities in text-to-image models

Abhishek Sureddy, Dishant Padalia, Nandhinee Periyakaruppa, Oindrila Saha, Adina Williams, Adriana Romero Soriano, Megan Richards, Polina Kirichenko, Melissa Hall

June 14, 2024

June 05, 2024


Cache Me if You Can: Accelerating Diffusion Models through Block Caching

Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, Ji Hou, Zijian He, Artsiom Sanakoyeu, Peizhao Zhang, Sam Tsai, Jonas Kohler, Christian Rupprecht, Daniel Cramers, Peter Vajda, Jialiang Wang

June 05, 2024

May 06, 2024



Solving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient Viewpoint

Haoyue Tang, Tian Xie

May 06, 2024

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.