FEATURED
Natural Language Processing
Bringing the world closer together with a foundational multimodal model for speech translation
August 22, 2023
7 minute read

The world we live in has never been more interconnected—the global proliferation of the internet, mobile devices, social media, and communication platforms gives people access to more multilingual content than ever before. In such a context, having an on-demand ability to communicate and understand information in any language becomes increasingly important. While such a capability has long been dreamed of in science fiction, AI is on the verge of bringing this vision into technical reality.


Today, we’re introducing SeamlessM4T, a foundational multilingual and multitask model that seamlessly translates and transcribes across speech and text. SeamlessM4T supports:

  • Automatic speech recognition for nearly 100 languages
  • Speech-to-text translation for nearly 100 input and output languages
  • Speech-to-speech translation, supporting nearly 100 input languages and 35 (+ English) output languages
  • Text-to-text translation for nearly 100 languages
  • Text-to-speech translation, supporting nearly 100 input languages and 35 (+ English) output languages

In keeping with our approach to open science, we’re publicly releasing SeamlessM4T under CC BY-NC 4.0 to allow researchers and developers to build on this work. We’re also releasing the metadata of SeamlessAlign, the biggest open multimodal translation dataset to date, totaling 470,000 hours of mined speech and text alignments. We make it easy for the community to perform mining on their own monolingual datasets with SONAR, a complete suite of speech and text sentence encoders, and stopes, our library for multimodal data processing and parallel data mining. All research advancements are supported by fairseq2, our next-generation sequence modeling library.

Building a universal language translator, like the fictional Babel Fish in The Hitchhiker’s Guide to the Galaxy, is challenging because existing speech-to-speech and speech-to-text systems only cover a small fraction of the world’s languages. SeamlessM4T represents a significant breakthrough in the field of speech-to-speech and speech-to-text by addressing the challenges of limited language coverage and a reliance on separate systems, which divide the task of speech-to-speech translation into multiple stages across subsystems. These systems can leverage large amounts of data and generally perform well for only one modality. Our challenge was to create a unified multilingual model that could do it all.

We believe the work we’re announcing today is a significant step forward in this journey. Our single model provides on-demand translations that enable people who speak different languages to communicate more effectively. We significantly improve performance for the low and mid-resource languages we support. These are languages that have smaller digital linguistic footprints. We also maintain strong performance on high-resource languages, such as English, Spanish, and German. SeamlessM4T implicitly recognizes the source languages, without the need for a separate language identification model.


This work builds on advancements Meta and others have made over the years in the quest to create a universal translator. Last year, we released No Language Left Behind (NLLB), a text-to-text machine translation model that supports 200 languages and has since been integrated into Wikipedia as one of its translation providers. A few months later, we shared a demo of our Universal Speech Translator, which was the first direct speech-to-speech translation system for Hokkien, a language without a widely used writing system. Through this, we developed SpeechMatrix, the first large-scale multilingual speech-to-speech translation dataset, derived from SpeechLASER, a breakthrough in supervised representation learning. Earlier this year, we also shared Massively Multilingual Speech, which provides automatic speech recognition, language identification, and speech synthesis technology across more than 1,100 languages. SeamlessM4T draws on findings from all of these projects to enable a multilingual and multimodal translation experience stemming from a single model, built across a wide range of spoken data sources and with state-of-the-art results.

Our approach

Building a unified model requires a sequence modeling toolkit that is lightweight and easily composable with other modern PyTorch ecosystem libraries. We redesigned fairseq, our original sequence modeling toolkit. With more efficient modeling and data loader APIs, fairseq2 helps power the modeling behind SeamlessM4T.

For the model, we use the multitask UnitY model architecture, which is capable of directly generating translated text and speech. This new architecture also supports automatic speech recognition, text-to-text, text-to-speech, speech-to text, and speech-to-speech translations that are already a part of the vanilla UnitY model. The multitask UnitY model consists of three main sequential components. Text and speech encoders have the task of recognizing speech input in nearly 100 languages. The text decoder then transfers that meaning into nearly 100 languages for text followed by a text-to-unit model to decode into discrete acoustic units for 36 speech languages. The self-supervised encoder, speech-to-text, text-to-text translation components, and text-to-unit model are pre-trained to improve the quality of the model and for training stability The decoded discrete units are then converted into speech using a multilingual HiFi-GAN unit vocoder.

How the encoder processes speech

Our self-supervised speech encoder, w2v-BERT 2.0 which is an improved version of w2v-BERT that improves its training stability and representation quality, learns to find structure and meaning in speech by analyzing millions of hours of multilingual speech. The encoder takes the audio signal, breaks it down into smaller parts, and builds an internal representation of what is being said. Because spoken words are made up of many of those sounds and characters, we use a length adaptor to roughly map them to actual words.

How the encoder processes text

Similarly, we have a text encoder that is based on the NLLB model. It has been trained to understand text in nearly 100 languages and produce representations that are useful for translation.

Producing text

Our text decoder is trained to take encoded speech representations or text representations. This can be applied to tasks in the same language, such as automatic speech recognition, and multilingual translation tasks. For example, someone can say the word “bonjour” in French, and expect the translated text in Swahili to be “habari.” With multitask training, we leverage the strengths of a strong text-to-text translation model (NLLB) to guide our speech-to-text translation model via token-level knowledge distillation.

Producing speech

We use acoustic units to represent speech on the target side. The text-to-unit (T2U) component in the UnitY model generates these discrete speech units based on the text output and is pre-trained on ASR data prior to UnitY fine-tuning. A multilingual HiFi-GAN unit vocoder is then used to convert these discrete units into audio waveforms.

Data scaling

Data-driven models like SeamlessM4T usually benefit from large amounts of high-quality end-to-end data, namely speech-to-text and speech-to-speech data. Relying only on human transcribed and translated speech does not scale to tackle the challenging task of speech translation for 100 languages. We build upon our pioneering work on text-to-text mining using a similarity measure in a joint embedding space, and initial work in speech mining to create additional resources to train the SeamlessM4T model.

First, we build a new massively multilingual and -modal text embedding space for 200 languages, named SONAR (Sentence-level mOdality- and laNguage-Agnostic Representations), which substantially outperforms existing approaches like LASER3 or LaBSE in multilingual similarity search. We then apply a teacher-student approach to extend this embedding space to the speech modality and currently cover 36 languages. Mining is performed in data from publicly available repositories of web data (tens of billions of sentences) and speech (4 million hours). In total, we were able to automatically align more than 443,000 hours of speech with texts and create about 29,000 hours of speech-to-speech alignments. This corpus, dubbed SeamlessAlign, is the largest open speech/speech and speech/text parallel corpus in terms of total volume and language coverage to date.

Results

For these tasks and languages, SeamlessM4T achieves state-of-the-art results for nearly 100 languages and multitask support across automatic speech recognition, speech-to-text, speech-to-speech, text-to-speech, and text-to-text translation—all in a single model. We also significantly improve performance for low and mid-resource languages supported and maintain strong performance on high-resource languages.

To more accurately evaluate the system without depending on text-based metrics, we extended our text-less metric into BLASER 2.0, which now enables evaluation across speech and text units with similar accuracy compared to its predecessor. When tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks (average improvements of 37% and 48%, respectively) compared to the current state-of-the-art model.

SeamlessM4T also outperforms previous state-of-the-art competitors.

How we built SeamlessM4T responsibly

It is important that translation systems are accurate. As with all AI systems, there are inherent risks that the model could mistranscribe what a person wants to say or generate outputs that are toxic or inaccurate.

At Meta, our AI research and development follows a responsible framework that is guided by our five pillars of Responsible AI. In line with our commitment to responsible AI, we conducted research on toxicity and bias to help us understand which areas of the model might be sensitive. For toxicity, we extended our highly multilingual toxicity classifier to speech to help identify toxic words from speech inputs and outputs. We filtered unbalanced toxicity in training data. If input or output contained different amounts of toxicity, we removed that training pair.

The demo we’re releasing today showcases the capabilities of SeamlessM4T and is an important part of the research. We detect toxicity in both the input and the output for the demo. If toxicity is only detected in the output, it means that toxicity is added. In this case, we include a warning and do not show the output. When comparing our models to the state of the art, we significantly reduce added toxicity on both speech-to-speech and speech-to-text translation.

Gender bias, where the results unfairly favor a gender and sometimes default to gender stereotypes, is another area we are beginning to evaluate in languages at scale. We are now able to quantify gender bias in dozens of speech translation directions by extending our previously designed Multilingual HolisticBias dataset to speech.

Our work around safety and security is an ongoing effort. We’ll continue to research and take action in this area to continuously improve SeamlessM4T and reduce any instances of toxicity we see in the model.

Providing access to our technology

With state-of-the-art results, we believe SeamlessM4T is an important breakthrough in the AI community’s quest to create universal multitask systems. In keeping with our approach to open science, we’re excited to share our model publicly to allow researchers and developers to build on this technology.

This is only the latest step in our ongoing effort to build AI-powered technology that helps connect people across languages. In the future, we want to explore how this foundational model can enable new communication capabilities—ultimately bringing us closer to a world where everyone can be understood.


Read the paper
Try the demo
Download the code, model, and data

This blog post was made possible by the work of Bapi Akula, Pierre Andrews, Can Balioglu, Loïc Barrault, Onur Çelebi, Peng-Jen Chen, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Naji El Hachem, Maha Elbayad, Brian Ellis, Hady Elsahar, Cynthia Gao, Hongyu Gong, Francisco Guzmán, Justin Haaheim, Prangthip Hansanti, Kevin Heffernan, John Hoffman, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Justine Kao, Christopher Klaiber, Ilia Kulikov, Janice Lam, Ann Lee, Daniel Li, Pengwei Li, Daniel Licht, Xutai Ma, Jean Maillard, Ruslan Mavlyutov, Gabriel Mejia Gonzalez, Alexandre Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Marta R. Costa-jussà, Alice Rakotoarison, Kaushik Ram Sadagopan, Mohamed Ramadan, Abinesh Ramakrishnan, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Changhan Wang, Jeff Wang, Skyler Wang, Guillaume Wenzek, Carleigh Wood, Yilin Yang, Ethan Ye, and Bokai Yu.

Share:

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
FEATURED
Generative AI
AudioCraft: A simple one-stop shop for audio modeling
August 2, 2023
Research
Improving fairness and robustness in speech recognition
July 13, 2023
FEATURED
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023