August 5, 2021
We’re proud to announce that we won first place at this year’s annual multilingual speech translation competition hosted by the International Conference on Spoken Language Translation (IWSLT), the premier annual venue for AI research on speech translation.
We achieved significant accuracy gains in all seven language directions featured in the competition, including four supervised directions and three zero-shot directions (using no labeled examples), with a 4.4 BLEU improvement on average over the second-best system submitted. To put this in perspective, a gain of more than just 1 BLEU is considered meaningful in this field.
To build our system, we used our latest speech-to-text translation techniques, including transfer learning across modalities, tasks, and languages — along with previous research from Facebook AI such as mBART, wav2vec 2.0, which learn multiple languages using unlabeled data, and open source data sets like VoxPopuli.
To demonstrate that our cutting-edge research model can also work well in real-world scenarios, we also used it to power subtitles from Portuguese to English for Facebook’s 2021 Connectivity Research Workshop earlier this year. You can read the full paper and download our models as part of Fairseq.
Speech-to-text translation — taking audio in one language and creating captions in another language — is important in lowering language barriers and making multimedia content more accessible to everyone. Unfortunately, most AI-powered speech translation systems today require building separate models and gathering millions of examples for each language. We can translate English speech to Spanish text, for example, because public data for training AI systems is widely available in both languages. What we want is to do this efficiently and accurately for all languages — using one powerful, flexible system.
Over the past couple of years, we’ve made rapid progress in multilingual speech technology research. We’ve built and open-sourced the largest multilingual data sets, and we’ve recently created new pretrained models like wav2vec 2.0, wav2vec Unsupervised, and mBART, which learn multiple languages using unlabeled data. Most recently, we’ve built new innovative techniques on top of these models using multi-task learning and more efficient fine-tuning of pre-trained models.
Our models outperformed the best end-to-end models as well as the cascaded systems by more than 4 BLEU on average. We narrowed the gap between the speech-to-text translation system and the text-to-text translation system — which, historically, has been more extensively studied and progressed faster than speech translation. Unlike text-to-text, speech translation has the added complexity of converting speech to text. Specifically, on average, our system is only 3 BLEU behind strong text to text translation system, that relies on oracle speech transcripts. In fact, we even achieved performance that’s on par with the text-to-text translation system for English to Spanish.
It’s effective in all language directions, including four supervised directions and three zero-shot directions, where there are no labeled examples for training.
To demonstrate its practicality, we’ve used our system to provide subtitles from Portuguese to English for Facebook’s 2021 Connectivity Research Workshop. The results were extremely promising, as demonstrated in the featured video above. We believe our new advancements are an important step toward breaking down language barriers, bridging speech and text, and helping keep people connect around the world.
We believe our new advancements are an important step towards breaking down language barriers, bridging speech and text, and helping keep people connect around the world.
We overcame three core technical challenges to develop our system: augmenting the training data to make use of full public data sets available; fine-tuning pretrained models on relevant speech-to-text data; and enabling knowledge transfer between different tasks and modalities.
Like most neural models, speech-to-text translation systems typically require large amounts of parallel training data: speech in the source language accompanied by corresponding texts in the target language. The more data we could use to train our model, the more accurate its translations would be, so in addition to training it on CoVoST V2 (which we’ll present at Interspeech this year) and EuroParl, we also mined parallel data from publicly available speech and text corpora, such as Common Voice and CCNet. We used the same mining pipeline as the CCMatrix project which is based on the multilingual LASER text encoder to align source speech with target texts by finding the alignments between the transcripts of source speech and target texts based on semantic similarity.
In addition to data augmentation, pre-training offers a workaround to data scarcity, and enables models to learn general knowledge about speech and text from a large amount of unlabeled data. We further fine-tuned the pre-trained speech encoder, text encoder and decoder on parallel speech-to-text data.
The backbone of our system is multilingual speech translation from finetuning pretrained models. The speech encoder is a wav2vec 2.0 model pre-trained on unlabeled multilingual audio corpora using contrastive loss. It takes speech inputs and transforms them into high-quality hidden representations. The text encoder and decoder are initialized with a pre-trained multilingual BART model (mBART), which is first trained on monolingual textual data and then tuned on parallel texts.
Considering that different modalities are involved in the computation, we adopted multi-task learning techniques and trained models on speech-to-text and text-to-text tasks jointly, which encourages knowledge transfer between different tasks and modalities. The knowledge transfer training tasks include speech-to-text and text-to-text translation.
In the framework of multi-task learning, cross attentive regularization and online knowledge distillation are used to enhance knowledge transfer between tasks. In the last stage, our models are fine tuned on the speech-to-text task only without using text input or text encoder.
This milestone brings us one step closer on the path to building a universal speech translator that breaks barriers, makes content accessible to everyone in their own language, and enables cross-lingual communication. But with thousands of languages spoken around the world, there’s still a long road ahead. We’ll continue to expand our research and develop techniques to reduce compute resources and model size to deploy these state-of-the-art models.
Since the goal of IWSLT’s spoken language translation competition is to provide a platform for researchers to share their ideas and spur research, we’re making the models for our winning systems available for everyone to download as part of fairseq. We hope our milestone will provide a strong test bed to spur additional research and advancements as we push progress in multilingual translation.