NLP

BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation

February 07, 2025

Abstract

This paper presents BOUQuET, a multicentric and multi-register/domain dataset and benchmark, and its broader collaborative extension initiative. This dataset is handcrafted in non-English languages first, each of these source languages being represented among the 23 languages commonly used by half of the world’s population and therefore having the potential to serve as pivot languages that will enable more accurate translations. The dataset is specially designed to avoid contamination and be multicentric, so as to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation (MT) datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for the open initiative and call for translation participation that we are launching to extend it to a multi-way parallel corpus to any written language

Download the Paper

AUTHORS

Written by

The Omnilingual MT Team

Pierre Andrews

Mikel Artetxe

Mariano Coria Meglioli

Marta R. Costa-jussa

Joe Chuang

David Dale

Cynthia Gao

Jean Maillard

Alexandre Mourachko

Christophe Ropers

Safiyyah Saleem

Eduardo Sánchez

Yiannis Tsiamas

Arina Turkatenko

Albert Ventayol

Shireen Yates

Publisher

arXiv

Related Publications

December 12, 2025

NLP

COMPUTER VISION

Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru, Xiaochuang Han, Bhuwan Dhingra, Emily Dinan, Maha Elbayad

December 12, 2025

November 10, 2025

RESEARCH

SPEECH & AUDIO

Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

Omnilingual ASR team, Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, Kevin Chan, Chierh Cheng, Joe Chuang, Caley Drooff, Mark Duppenthaler, Paul-Ambroise Duquenne, Alexander Erben, Cynthia Gao, Gabriel Mejia Gonzalez, Kehan Lyu, Sagar Miglani, Vineel Pratap, Kaushik Ram Sadagopan, Safiyyah Saleem, Arina Turkatenko, Albert Ventayol-Boada, Zheng-Xin Yong, Yu-An Chung, Jean Maillard, Rashel Moritz, Alexandre Mourachko, Mary Williamson, Shireen Yates

November 10, 2025

October 19, 2025

RESEARCH

NLP

Controlling Multimodal LLMs via Reward-guided Decoding

Oscar Mañas, Pierluca D'Oro, Koustuv Sinha, Adriana Romero Soriano, Michal Drozdzal, Aishwarya Agrawal

October 19, 2025

October 13, 2025

REINFORCEMENT LEARNING

RESEARCH

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu

October 13, 2025

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.