NLP

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

February 10, 2026

Abstract

LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle---including idea generation, experiment analysis and iterative refinement---without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.

Download the Paper

AUTHORS

Written by

Alisia Lupidi

Bhavul Gauri

Thomas Simon Foster

Bassel Al Omari

Despoina Magka

Alberto Pepe

Alexis Audran-Reiss

Muna Aghamelu

Nicolas Baldwin

Lucia Cipolina-Kun

Jean-Christophe Gagnon-Audet

Chee Hau Leow

Sandra Lefdal

Hossam Mossalam

Abhinav Moudgil

Saba Nazir

Emanuel Tewolde

Isabel Urrego

Jordi Armengol-Estape

Amar Budhiraja

Gaurav Chaurasia

Abhishek Charnalia

Derek Dunfield

Karen Hambardzumyan

Daniel Izcovich

Martin Josifoski

Ishita Mediratta

Kelvin Niu

Parth Pathak

Michael Shvartsman

Edan Toledo

Anton Protopopov

Roberta Raileanu

Alexander Miller

Tatiana Shavrina

Jakob Foerster

Yoram Bachrach

Publisher

arXiv

Related Publications

February 27, 2026

HUMAN & MACHINE INTELLIGENCE

RESEARCH

Unified Vision–Language Modeling via Concept Space Alignment

Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk

February 27, 2026

December 26, 2025

REINFORCEMENT LEARNING

NLP

Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus, Ilia Kulikov, Brandon Amos, Remi Munos, Ivan Evtimov, Kamalika Chaudhuri, Arman Zharmagambetov

December 26, 2025

December 18, 2025

NLP

How Good is Post-Hoc Watermarking With Language Model Rephrasing?

Pierre Fernandez, Tom Sander, Hady Elsahar, Hongyan Chang, Tomáš Souček, Sylvestre Rebuffi, Valeriu Lacatusu, Tuan Tran, Alexandre Mourachko

December 18, 2025

December 12, 2025

NLP

COMPUTER VISION

Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru, Xiaochuang Han, Bhuwan Dhingra, Emily Dinan, Maha Elbayad

December 12, 2025

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.