March 17, 2026
Cross-lingual sentence encoders have traditionally been limited to a few hundred languages, and have sacrificed downstream performance to achieve better alignment across languages, limiting their adoption. In this work, we introduce OmniSONAR, a novel family of omnilingual, cross-lingual and cross-modal sentence embedding models that breaks this barrier. We establish a unified semantic space, natively encompassing text, speech, code and mathematical expressions, while achieving state-of-the-art downstream performance for an unprecedented scale of thousands of languages, from high-resource languages to extremely low-resource varieties. To achieve this scale without representation collapse and while maintaining top-tier performance in the high-resource languages, we employ a progressive training strategy. We first build a state-of-the-art foundational embedding space for 200 languages using an LLM-initialized Encoder-Decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Leveraging this strong foundational space, we expand to several thousands of language varieties via a specialized two-stage teacher-student encoder distillation framework. Further modeling extensions derived from OmniSONAR address long context inputs and token-centric representations. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR redefines the state of the art for multilingual representation learning. It halves the cross-lingual similarity search error rate of the previous best models on the 200 languages of FLORES, while also achieving a staggering 15-fold error rate reduction across 1,560 languages in the BIBLE benchmark. Furthermore, our embedding model enables unprecedented translation capabilities, outperforming NLLB-3B on several multilingual benchmarks, and surpassing all previous models, including multi-billion-parameter LLMs, by 15 chrF++ points in 1,560→English translation in the BIBLE benchmark. Beyond alignment and translation, OmniSONAR demonstrates strong general-purpose capabilities across downstream embedding tasks on MTEB and programming languages on XLCoST. For the speech modality, our massively multilingual extension exhibits a 43% lower error rate in cross-lingual and cross-modal similarity search, while achieving 97% of SeamlessM4T performance in speech-to-text translation, despite being a zero-shot translation model trained only with ASR data. Finally, by training an encoder-decoder language model, Spectrum, exclusively on English text that processes OmniSONAR sequences, we unlock immediate high-performance transfer to thousands of languages and the speech modality for complex downstream tasks. These outstanding results position OmniSONAR as a robust, language- and modality-agnostic foundation for any downstream usage.
Written by
Omnilingual SONAR Team
Ioannis Tsiamas
Yen Meng
Vivek Iyer
Guillem Ramirez
Jaehyeong Jo
Alexandre Mourachko
Yu-An Chung
Artyom Kozhevnikov
Belen Alastruey
Christophe Ropers
David Dale
João Maria Janeiro
Kevin Heffernan
Marta R. Costa-jussa
Paul-Ambroise Duquenne
Pere Lluís Huguet Cabot
Publisher
arXiv
Research Topics
June 05, 2026
Anshumali Shrivastava, Jason Chen, Qi Ma, Zeyu Yang
June 05, 2026
May 26, 2026
Valentin Wyart, Huy V. Vo, Jean Remi King, Josephine Raugel, Jérémy Rapin, Marc Szafraniec, Max Seitzer, Patrick Labatut, Piotr Bojanowski
May 26, 2026
May 20, 2026
Alvin W. M. Tan, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Michael C. Frank, Angel Villar Corrales, Charles-Eric Saint-James, Dongyan Lin, Emmanuel Dupoux, Jiayi Shen, Juan Pino, Mahi Luthra, Martin Gleize, Phillip Rust, Rashel Moritz, Sheila Krogh-Jespersen, Surya Parimi, Tom Fizycki, Vanessa Stark, Yosuke Higuchi, Youssef Benchekroun
May 20, 2026
May 18, 2026
Alexandre Rezende, Rohit Patel, Steven McClain
May 18, 2026

Our approach
Latest news
Foundational models