February 10, 2026
LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle---including idea generation, experiment analysis and iterative refinement---without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.
Written by
Alisia Lupidi
Bhavul Gauri
Thomas Simon Foster
Bassel Al Omari
Despoina Magka
Alberto Pepe
Alexis Audran-Reiss
Muna Aghamelu
Nicolas Baldwin
Lucia Cipolina-Kun
Jean-Christophe Gagnon-Audet
Chee Hau Leow
Sandra Lefdal
Hossam Mossalam
Abhinav Moudgil
Saba Nazir
Emanuel Tewolde
Isabel Urrego
Jordi Armengol-Estape
Amar Budhiraja
Gaurav Chaurasia
Abhishek Charnalia
Derek Dunfield
Karen Hambardzumyan
Daniel Izcovich
Martin Josifoski
Ishita Mediratta
Kelvin Niu
Parth Pathak
Michael Shvartsman
Edan Toledo
Anton Protopopov
Roberta Raileanu
Tatiana Shavrina
Jakob Foerster
Yoram Bachrach
Publisher
arXiv
Research Topics
February 27, 2026
Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk
February 27, 2026
December 26, 2025
Anselm Paulus, Ilia Kulikov, Brandon Amos, Remi Munos, Ivan Evtimov, Kamalika Chaudhuri, Arman Zharmagambetov
December 26, 2025
December 18, 2025
Pierre Fernandez, Tom Sander, Hady Elsahar, Hongyan Chang, Tomáš Souček, Sylvestre Rebuffi, Valeriu Lacatusu, Tuan Tran, Alexandre Mourachko
December 18, 2025
December 12, 2025
Raghuveer Thirukovalluru, Xiaochuang Han, Bhuwan Dhingra, Emily Dinan, Maha Elbayad
December 12, 2025

Our approach
Latest news
Foundational models