Products

AI Research

Resources

About

Products

NLP

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

February 10, 2026

Abstract

LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle---including idea generation, experiment analysis and iterative refinement---without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.

Download the Paper

AUTHORS

Written by

Alisia Lupidi

Bhavul Gauri

Thomas Simon Foster

Bassel Al Omari

Despoina Magka

Alberto Pepe

Alexis Audran-Reiss

Muna Aghamelu

Nicolas Baldwin

Lucia Cipolina-Kun

Jean-Christophe Gagnon-Audet

Chee Hau Leow

Sandra Lefdal

Hossam Mossalam

Abhinav Moudgil

Saba Nazir

Emanuel Tewolde

Isabel Urrego

Jordi Armengol-Estape

Amar Budhiraja

Gaurav Chaurasia

Abhishek Charnalia

Derek Dunfield

Karen Hambardzumyan

Daniel Izcovich

Martin Josifoski

Ishita Mediratta

Kelvin Niu

Parth Pathak

Michael Shvartsman

Edan Toledo

Anton Protopopov

Roberta Raileanu

Alexander Miller

Tatiana Shavrina

Jakob Foerster

Yoram Bachrach

Publisher

arXiv

Research Topics

Natural Language Processing (NLP)

Related Publications

July 17, 2026

CONVERSATIONAL AI

REINFORCEMENT LEARNING

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

Zilin Xiao, Qi Ma, Jason Chen, Xintao Chen, Avinash Atreya, Hanjie Chen, Vicente Ordonez

July 17, 2026

Read the Paper

June 05, 2026

CONVERSATIONAL AI

RANKING AND RECOMMENDATIONS

Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval

Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava

June 05, 2026

Read the Paper

May 20, 2026

HUMAN & MACHINE INTELLIGENCE

RESEARCH

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra, Charles-Eric Saint-James, Rashel Moritz, Sheila Krogh-Jespersen, Vanessa Stark, Surya Parimi, Jiayi Shen, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Tom Fizycki, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Juan Pino, Michael C. Frank, Emmanuel Dupoux