November 17, 2023
Evaluating the factuality of long-form text generated by large language models (LMs) is nontrivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs—InstructGPT, ChatGPT, and the retrievalaugmented PerplexityAI—and report new analysis demonstrating the need for such a finegrained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models.
Written by
Mike Lewis
Hannaneh Hajishirzi
Kalpesh Krishna
Mohit Iyyer
Pang Wei Koh
Sewon Min
Xinxi Lyu
Publisher
EMNLP
Research Topics
June 05, 2026
Anshumali Shrivastava, Jason Chen, Qi Ma, Zeyu Yang
June 05, 2026
May 20, 2026
Alvin W. M. Tan, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Michael C. Frank, Angel Villar Corrales, Charles-Eric Saint-James, Dongyan Lin, Emmanuel Dupoux, Jiayi Shen, Juan Pino, Mahi Luthra, Martin Gleize, Phillip Rust, Rashel Moritz, Sheila Krogh-Jespersen, Surya Parimi, Tom Fizycki, Vanessa Stark, Yosuke Higuchi, Youssef Benchekroun
May 20, 2026
May 18, 2026
Alexandre Rezende, Rohit Patel, Steven McClain
May 18, 2026
May 12, 2026
Corentin Bel, Linnea Evanson, Julien Gadonneix, Andrea Santos Revilla, Mingfang (Lucy) Zhang, Julie Bonnaire, Charlotte Caucheteux, Alexandre Défossez, Théo Desbordes, Pablo Diego-Simón, Shubh Khanna, Juliette Millet, Pierre Orhan, Saarang Panchavati, Antoine Ratouchniak, Alexis Thual, Hubert Jacob Banville, Jarod Levy, Jean Remi King, Josephine Raugel, Jérémy Rapin, Katelyn Begany, Marlene Careil, Simon Dahan, Sophia Houhamdi, Stéphane d'Ascoli, Teon Brooks, Yohann Benchetrit
May 12, 2026

Our approach
Latest news
Foundational models