CONVERSATIONAL AI

RESEARCH

GIM: Evaluating models via tasks that integrate multiple cognitive domains

May 18, 2026

Abstract

As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public–private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model × thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection, and increasing thinking tokens has diminishing marginal returns. We release the evaluation framework, calibrated IRT parameters, and all public problems.

Download the Paper

AUTHORS

Written by

Rohit Patel

Alexandre Rezende

Steven McClain

Publisher

arXiv

Research Topics

Conversational AI

Natural Language Processing (NLP)

Core Machine Learning

Related Publications

June 05, 2026

CONVERSATIONAL AI

RANKING AND RECOMMENDATIONS

Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval

Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava

June 05, 2026

May 26, 2026

HUMAN & MACHINE INTELLIGENCE

THEORY

Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

Josephine Raugel, Max Seitzer, Marc Szafraniec, Huy V. Vo, Jérémy Rapin, Patrick Labatut, Piotr Bojanowski, Valentin Wyart, Jean Remi King

May 26, 2026

May 20, 2026

HUMAN & MACHINE INTELLIGENCE

RESEARCH

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra, Charles-Eric Saint-James, Rashel Moritz, Sheila Krogh-Jespersen, Vanessa Stark, Surya Parimi, Jiayi Shen, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Tom Fizycki, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Juan Pino, Michael C. Frank, Emmanuel Dupoux

May 20, 2026

May 12, 2026

HUMAN & MACHINE INTELLIGENCE

RESEARCH

NeuralSet: A High-Performing Python Package for Neuro-AI

Jean Remi King, Corentin Bel, Linnea Evanson, Julien Gadonneix, Sophia Houhamdi, Jarod Levy, Josephine Raugel, Andrea Santos Revilla, Mingfang (Lucy) Zhang, Julie Bonnaire, Charlotte Caucheteux, Alexandre Défossez, Théo Desbordes, Pablo Diego-Simón, Shubh Khanna, Juliette Millet, Pierre Orhan, Saarang Panchavati, Antoine Ratouchniak, Alexis Thual, Teon Brooks, Katelyn Begany, Yohann Benchetrit, Marlene Careil, Hubert Jacob Banville, Stéphane d'Ascoli, Simon Dahan, Jérémy Rapin

May 12, 2026

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.