Equipping machines with the ability to recognize and produce speech can make information accessible to many more people, including those who rely entirely on voice to access information. However, producing good-quality machine learning models for these tasks requires large amounts of labeled data — in this case, many thousands of hours of audio, along with transcriptions. For most languages, this data simply does not exist. For example, existing speech recognition models only cover approximately 100 languages — a fraction of the 7,000+ known languages spoken on the planet. Even more concerning, nearly half of these languages are in danger of disappearing in our lifetime.
In the Massively Multilingual Speech (MMS) project, we overcome some of these challenges by combining wav2vec 2.0, our pioneering work in self-supervised learning, and a new dataset that provides labeled data for over 1,100 languages and unlabeled data for nearly 4,000 languages. Some of these, such as the Tatuyo language, have only a few hundred speakers, and for most of these languages, no prior speech technology exists. Our results show that the Massively Multilingual Speech models outperform existing models and cover 10 times as many languages. Meta is focused on multilinguality in general: For text, the NLLB project scaled multilingual translation to 200 languages, and the Massively Multilingual Speech project scales speech technology to many more languages.
Today, we are publicly sharing our models and code so that others in the research community can build upon our work. Through this work, we hope to make a small contribution to preserve the incredible language diversity of the world.
Collecting audio data for thousands of languages was our first challenge because the largest existing speech datasets cover at most 100 languages. To overcome it, we turned to religious texts, such as the Bible, that have been translated in many different languages and whose translations have been widely studied for text-based language translation research. These translations have publicly available audio recordings of people reading these texts in different languages. As part of this project, we created a dataset of readings of the New Testament in over 1,100 languages, which provided on average 32 hours of data per language.
By considering unlabeled recordings of various other Christian religious readings, we increased the number of languages available to over 4,000. While this data is from a specific domain and is often read by male speakers, our analysis shows that our models perform equally well for male and female voices. And while the content of the audio recordings is religious, our analysis shows that this does not overly bias the model to produce more religious language. We believe this is because we use a Connectionist Temporal Classification approach, which is far more constrained compared with large language models (LLMs) or sequence to-sequence models for speech recognition.
We preprocessed the data to improve quality and to make it usable by our machine learning algorithms. To do so, we trained an alignment model on existing data in over 100 languages and used this model together with an efficient forced alignment algorithm that can process very long recordings of about 20 minutes or more. We applied multiple rounds of this process and performed a final cross-validation filtering step based on model accuracy to remove potentially misaligned data. To enable other researchers to create new speech datasets, we added the alignment algorithm to PyTorch and released the alignment model.
Thirty-two hours of data per language is not enough to train conventional supervised speech recognition models. This is why we built on wav2vec 2.0, our prior work on self-supervised speech representation learning, which greatly reduced the amount of labeled data needed to train good systems. Concretely, we trained self-supervised models on about 500,000 hours of speech data in over 1,400 languages — this is nearly five times more languages than any known prior work. The resulting models were then fine-tuned for a specific speech task, such as multilingual speech recognition or language identification.
To get a better understanding of how well models trained on the Massively Multilingual Speech data perform, we evaluated them on existing benchmark datasets, such as FLEURS.
We trained multilingual speech recognition models on over 1,100 languages using a 1B parameter wav2vec 2.0 model. As the number of languages increases, performance does decrease, but only very slightly: Moving from 61 to 1,107 languages increases the character error rate by only about 0.4 percent but increases the language coverage by over 18 times.
In a like-for-like comparison with OpenAI’s Whisper, we found that models trained on the Massively Multilingual Speech data achieve half the word error rate, but Massively Multilingual Speech covers 11 times more languages. This demonstrates that our model can perform very well compared with the best current speech models.
Next, we trained a language identification (LID) model for over 4,000 languages using our datasets as well as existing datasets, such as FLEURS and CommonVoice, and evaluated it on the FLEURS LID task. It turns out that supporting 40 times the number languages still results in very good performance.
We also built text-to-speech systems for over 1,100 languages. Current text-to-speech models are typically trained on speech corpora that contain only a single speaker. A limitation of the Massively Multilingual Speech data is that it contains relatively few different speakers for many languages, and often only a single speaker. However, this is an advantage for building text-to-speech systems, and so we trained such systems for over 1,100 languages. We found that the speech produced by these systems is of good quality, as the examples below show.
We are encouraged by our results, but as with all new AI technologies, our models aren’t perfect. For example, there is some risk that the speech-to-text model may mistranscribe select words or phrases. Depending on the output, this could result in offensive and/or inaccurate language. We continue to believe that collaboration across the AI community is critical to the responsible development of AI technologies.
Toward a single speech model supporting thousands of languages
Many of the world’s languages are in danger of disappearing, and the limitations of current speech recognition and speech generation technology will only accelerate this trend. We envision a world where technology has the opposite effect, encouraging people to keep their languages alive since they can access information and use technology by speaking in their preferred language.
The Massively Multilingual Speech project presents a significant step forward in this direction. In the future, we want to increase the language coverage to support even more languages, and also tackle the challenge of handling dialects, which is often difficult for existing speech technology. Our goal is to make it easier for people to access information and to use devices in their preferred language. There are also many concrete use cases for speech technology — such as VR/AR technology — which can be used in a person’s preferred language - to messaging services that can understand everyone’s voice.
We also envision a future where a single model can solve several speech tasks for all languages. While we trained separate models for speech recognition, speech synthesis, and language identification, we believe that in the future, a single model will be able to accomplish all these tasks and more, leading to better overall performance.
This blog post was made possible by the work of Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Ali Elkahky, Zhaoheng Ni, Sayani Kundu, Maryam Fazel-Zarandi, Apoorv Vyas, Alexei Baevski, Yossef Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli.