February 07, 2025
This paper presents BOUQuET, a multicentric and multi-register/domain dataset and benchmark, and its broader collaborative extension initiative. This dataset is handcrafted in non-English languages first, each of these source languages being represented among the 23 languages commonly used by half of the world’s population and therefore having the potential to serve as pivot languages that will enable more accurate translations. The dataset is specially designed to avoid contamination and be multicentric, so as to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation (MT) datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for the open initiative and call for translation participation that we are launching to extend it to a multi-way parallel corpus to any written language
Written by
The Omnilingual MT Team
Pierre Andrews
Mikel Artetxe
Mariano Coria Meglioli
Marta R. Costa-jussa
Joe Chuang
David Dale
Cynthia Gao
Jean Maillard
Alexandre Mourachko
Christophe Ropers
Safiyyah Saleem
Eduardo Sánchez
Yiannis Tsiamas
Arina Turkatenko
Albert Ventayol
Shireen Yates
Publisher
arXiv
Research Topics
February 06, 2025
Jarod Levy, Mingfang (Lucy) Zhang, Svetlana Pinet, Jérémy Rapin, Hubert Jacob Banville, Stéphane d'Ascoli, Jean Remi King
February 06, 2025
February 06, 2025
Mingfang (Lucy) Zhang, Jarod Levy, Stéphane d'Ascoli, Jérémy Rapin, F.-Xavier Alario, Pierre Bourdillon, Svetlana Pinet, Jean Remi King
February 06, 2025
January 04, 2025
January 04, 2025
December 17, 2024
Jack Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Scott Yih, Xilun Chen
December 17, 2024
Foundational models
Our approach
Latest news
Foundational models