CORE MACHINE LEARNING

SYSTEMS RESEARCH

Croissant: A Metadata Format for ML-Ready Datasets

December 12, 2024

Abstract

Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.

Download the Paper

AUTHORS

Written by

Mubashara Akhtar

Omar Benjelloun

Costanza Conforti

Luca Foschini

Pieter Gijsbers

Joan Giner-Miguelez

Sujata Goswami

Nitisha Jain

Michalis Karamousadakis

Satyapriya Krishna

Michael Kuchnik

Sylvain Lesage

Quentin Lhoest

Pierre Marcenac

Manil Maskey

Peter Mattson

Luis Oala

Hamidah Oderinwale

Pierre Ruyssen

Tim Santos

Rajat Shinde

Elena Simperl

Arjun Suresh

Goeffry Thomas

Slava Tykhonov

Joaquin Vanschoren

Susheel Varma

Jos van der Velde

Steffen Vogler

Carole-Jean Wu

Luyao Zhang

Publisher

NeurIPS

Research Topics

Systems Research

Core Machine Learning

Related Publications

May 14, 2025

RESEARCH

CORE MACHINE LEARNING

UMA: A Family of Universal Models for Atoms

Brandon M. Wood, Misko Dzamba, Xiang Fu, Meng Gao, Muhammed Shuaibi, Luis Barroso-Luque, Kareem Abdelmaqsoud, Vahe Gharakhanyan, John R. Kitchin, Daniel S. Levine, Kyle Michel, Anuroop Sriram, Taco Cohen, Abhishek Das, Ammar Rizvi, Sushree Jagriti Sahoo, Zachary W. Ulissi, C. Lawrence Zitnick

May 14, 2025

May 14, 2025

HUMAN & MACHINE INTELLIGENCE

SPEECH & AUDIO

Emergence of Language in the Developing Brain

Linnea Evanson, Christine Bulteau, Mathilde Chipaux, Georg Dorfmüller, Sarah Ferrand-Sorbets, Emmanuel Raffo, Sarah Rosenberg, Pierre Bourdillon, Jean Remi King

May 14, 2025

April 04, 2025

NLP

CORE MACHINE LEARNING

Multi-Token Attention

Olga Golovneva, Tianlu Wang, Jason Weston, Sainbayar Sukhbaatar

April 04, 2025

February 28, 2025

SYSTEMS RESEARCH

Revisiting Reliability in Large-Scale Machine Learning Research Clusters

Apostolos Kokolis, Michael Kuchnik, John Hoffman, Adithya Kumar, Parth Malani, Faye Ma, Zachary DeVito, Shubho Sengupta, Kalyan Saladi, Carole-Jean Wu

February 28, 2025

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.