December 12, 2024
As the use of text-to-image generative models increases, so does the adoption of automatic benchmarking methods used in their evaluation. However, while metrics and datasets abound, there are few unified benchmarking libraries that provide a framework for performing evaluations across many datasets and metrics. Furthermore, the rapid introduction of increasingly robust benchmarking methods requires that evaluation libraries remain flexible to new datasets and metrics. Finally, there remains a gap in synthesizing evaluations in order to deliver actionable takeaways about model performance. To enable unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced “EvalGym”), a library for evaluating generative image models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency of text-to-image generative models. In addition, EvalGIM is designed with flexibility for user customization as a top priority and contains a structure that allows plug-and-play additions of new datasets and metrics. To enable actionable evaluation insights, we introduce “Evaluation Exercises” that highlight takeaways for specific evaluation questions. The Evaluation Exercises contain easy-to-use and reproducible implementations of two state-of-the-art evaluation methods of text-to-image generative models: consistency-diversity-realism Pareto Fronts and disaggregated measurements of performance disparities across groups. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models: robustness analyses of model rankings and balanced evaluations across different prompt styles. In this paper, we outline the EvalGIM library and provide guidance for how others can add new datasets, metrics, and visualizations to customize the library for their own use cases. We also demonstrate the utility of EvalGIM by using its Evaluation Exercises to explore several research questions about text-to-image generative models, such as the role of re-captioning training data or the relationship between quality and diversity in early training stages. We encourage text-to-image model exploration with EvalGIM and invite contributions at https://github.com/facebookresearch/EvalGIM/.
Written by
Reyhane Askari
Pietro Astolfi
Tariq Berrada Ifriqi
Marton Havasi
Yohann Benchetrit
Karen Ullrich
Carolina Braga
Abhishek Charnalia
Maeve Ryan
Michal Drozdzal
Jakob Verbeek
Publisher
arXiv
Research Topics
February 11, 2026
Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung-Levy, Felix Juefei-Xu
February 11, 2026
January 02, 2026
Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Xu, Chu Wang, Ali Thabet, Xiaoliang Dai, Xuan Ju, Alan Yuille, Ji Hou
January 02, 2026
December 18, 2025
Aleksandar Petrov, Pierre Fernandez, Tomáš Souček, Hady Elsahar
December 18, 2025
December 18, 2025
Sylvestre Rebuffi, Tuan Tran, Valeriu Lacatusu, Pierre Fernandez, Tomáš Souček, Tom Sander, Hady Elsahar, Alexandre Mourachko
December 18, 2025

Our approach
Latest news
Foundational models