November 10, 2025
While automatic speech recognition (ASR) systems have made remarkable progress in many high resource languages, most of the world’s 7,000+ languages remain unsupported, with thousands of long-tail languages effectively left behind. Expanding ASR coverage has long been regarded as prohibitively expensive and of limited benchmark value, further hampered by architectures that restrict language coverage to a fixed set that make extension inaccessible to most communities—all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, this article introduces Omnilingual ASR, the first large-scale ASR system designed for extensibility. More specifically, Omnilingual ASR enables communities to introduce unserved languages with only a handful of their own data samples. On the modeling side, Omnilingual ASR scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder–decoder architecture designed for zero-shot generalization, leveraging a large language model-inspired decoder to effectively exploit these representations. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to previously unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to more than 1,600 languages, the largest such effort to date—including over 500 never before served by any ASR system. Automatic evaluations show substantial gains over prior systems, especially in extreme low-resource conditions, and strong generalization to languages never encountered during training. Crucially, Omnilingual ASR is released as a family of models ranging from compact 300M variants for low-power devices to large 7B models for maximum accuracy. Throughout the paper, we reflect on the ethical considerations shaping this design and conclude by discussing its broader societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities alike, inviting new forms of participation without requiring onerous expertise or heavy compute. All open-source artifacts from this effort are available at https://github.com/facebookresearch/omnilingual-asr.
Written by
Omnilingual ASR team
Skyler Wang
Ife Adebara
Michael Auli
Kaushik Ram Sadagopan
Zheng-Xin Yong
Albert Ventayol-Boada
Alexandre Mourachko
Alexander Erben
Yu-An Chung
Arina Turkatenko
Artyom Kozhevnikov
Caley Drooff
Can Balioglu
Chierh Cheng
Christophe Ropers
Cynthia Gao
Gabriel Mejia Gonzalez
Gil Keren
Jean Maillard
Joe Chuang
Kehan Lyu
Kevin Chan
Mark Duppenthaler
Mary Williamson
Matthew Setzler
Paul-Ambroise Duquenne
Rashel Moritz
Safiyyah Saleem
Sagar Miglani
Shireen Yates
Vineel Pratap
Yen Meng
Publisher
arXiv
May 06, 2026
Saarang Panchavati, Antoine Ratouchniak, Mingfang (Lucy) Zhang, Elisa Cascardi, Hubert Banville, Jarod Levy, Jean-Rémi King, Jérémy Rapin, Katelyn Begany, Marlene Careil, Simon Dahan, Stéphane d'Ascoli, Teon Brooks, Yohann Benchetrit
May 06, 2026
April 16, 2026
Nicola Cancedda, Pontus Stenetorp, Alexis Audran-Reiss, Alisia Lupidi, Anton Protopopov, Bassel Al Omari, Carole-Jean Wu, Derek Dunfield, Despoina Magka, Edan Toledo, Hela Momand, Ishita Mediratta, Jakob Foerster, Jean-Christophe Gagnon-Audet, Karen Hambardzumyan, Kelvin Niu, Martin Josifoski, Michael Kuchnik, Michael Shvartsman, Nicolas Baldwin, Parth Pathak, Rishi Hazra, Tatiana Shavrina, Thomas Simon Foster, Yoram Bachrach
April 16, 2026
March 17, 2026
Omnilingual MT Team, Niyati Bafna, Ioannis Tsiamas, Mark Duppenthaler, Albert Ventayol-Boada, Alexandre Mourachko, Andrea Caciolai, Arina Turkatenko, Artyom Kozhevnikov, Belen Alastruey, Charles-Eric Saint-James, Chierh CHENG, Christophe Ropers, Cynthia Gao, David Dale, Edan Toledo, Eduardo Sánchez, Gabriel Mejia Gonzalez, Holger Schwenk, Jean Maillard, Joe Chuang, João Maria Janeiro, Kevin Heffernan, Marta R. Costa-jussa, Mary Williamson, Nate Ekberg, Paul-Ambroise Duquenne, Pere Lluís Huguet Cabot, Rashel Moritz, Shireen Yates, Surya Parimi
March 17, 2026
March 17, 2026
Omnilingual SONAR Team, Ioannis Tsiamas, Yen Meng, Vivek Iyer, Guillem Ramirez, Jaehyeong Jo, Alexandre Mourachko, Yu-An Chung, Artyom Kozhevnikov, Belen Alastruey, Christophe Ropers, David Dale, Holger Schwenk, João Maria Janeiro, Kevin Heffernan, Loic Barrault, Marta R. Costa-jussa, Paul-Ambroise Duquenne, Pere Lluís Huguet Cabot
March 17, 2026

Our approach
Latest news
Foundational models