CORE MACHINE LEARNING

SYSTEMS RESEARCH

Croissant: A Metadata Format for ML-Ready Datasets

December 12, 2024

Abstract

Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.

Download the Paper

AUTHORS

Written by

Mubashara Akhtar

Omar Benjelloun

Costanza Conforti

Luca Foschini

Pieter Gijsbers

Joan Giner-Miguelez

Sujata Goswami

Nitisha Jain

Michalis Karamousadakis

Satyapriya Krishna

Michael Kuchnik

Sylvain Lesage

Quentin Lhoest

Pierre Marcenac

Manil Maskey

Peter Mattson

Luis Oala

Hamidah Oderinwale

Pierre Ruyssen

Tim Santos

Rajat Shinde

Elena Simperl

Arjun Suresh

Goeffry Thomas

Slava Tykhonov

Joaquin Vanschoren

Susheel Varma

Jos van der Velde

Steffen Vogler

Carole-Jean Wu

Luyao Zhang

Publisher

NeurIPS

Research Topics

Systems Research

Core Machine Learning

Related Publications

December 12, 2024

NLP

CORE MACHINE LEARNING

Memory Layers at Scale

Vincent-Pierre Berges, Barlas Oguz

December 12, 2024

December 10, 2024

CORE MACHINE LEARNING

Flow Matching Guide and Code

Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky Chen, David Lopez-Paz, Heli Ben Hamu, Itai Gat

December 10, 2024

December 09, 2024

NLP

CORE MACHINE LEARNING

Discrete flow matching

Itai Gat, Tal Remez, Felix Kreuk, Ricky Chen, Gabriel Synnaeve, Yossef (Yossi) Adi, Yaron Lipman, Neta Shaul

December 09, 2024

November 20, 2024

SYSTEMS RESEARCH

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao

November 20, 2024

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.