February 07, 2025
This paper presents BOUQuET, a multicentric and multi-register/domain dataset and benchmark, and its broader collaborative extension initiative. This dataset is handcrafted in non-English languages first, each of these source languages being represented among the 23 languages commonly used by half of the world’s population and therefore having the potential to serve as pivot languages that will enable more accurate translations. The dataset is specially designed to avoid contamination and be multicentric, so as to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation (MT) datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for the open initiative and call for translation participation that we are launching to extend it to a multi-way parallel corpus to any written language
Written by
The Omnilingual MT Team
Pierre Andrews
Mikel Artetxe
Mariano Coria Meglioli
Marta R. Costa-jussa
Joe Chuang
David Dale
Cynthia Gao
Jean Maillard
Alexandre Mourachko
Christophe Ropers
Safiyyah Saleem
Eduardo Sánchez
Yiannis Tsiamas
Arina Turkatenko
Albert Ventayol
Shireen Yates
Publisher
arXiv
Research Topics
July 02, 2025
Joongwon (Daniel) Kim, Anirudh Goyal, Liang Tan, Hannaneh Hajishirzi, Srini Iyer, Tianlu Wang
July 02, 2025
May 14, 2025
Linnea Evanson, Christine Bulteau, Mathilde Chipaux, Georg Dorfmüller, Sarah Ferrand-Sorbets, Emmanuel Raffo, Sarah Rosenberg, Pierre Bourdillon, Jean Remi King
May 14, 2025
April 25, 2025
Rulin Shao, Qiao Rui, Varsha Kishore, Niklas Muennighoff, Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Scott Yih, Pang Wei Koh, Luke Zettlemoyer
April 25, 2025
April 17, 2025
Ansong Ni, Ruta Desai, Yang Li, Xinjie Lei, Dong Wang, Ramya Raghavendra, Gargi Ghosh, Daniel Li (FAIR), Asli Celikyilmaz
April 17, 2025
Our approach
Latest news
Foundational models