March 10, 2022
There’s a massive amount of publicly shared audio data out in the world, from audiobooks to archived radio programs. Using Meta AI’s continuously developing automatic speech recognition (ASR) systems, we’re able to build ASR models to make the most of that information – including raw, unlabeled audio. Traditionally, to build a speech recognition system, you need both an audio sample and the corresponding transcript, which are then inputted to train the models. Transcribing large quantities of audio is incredibly labor intensive, however, and is not scalable to hundreds or thousands of languages and dialects. To help address this gap, Meta AI is developing a new high-performance open-source multilingual ASR model that uses pseudo labeling, a popular machine learning technique that leverages unlabeled data. Our latest work in pseudo labeling makes it possible to build an effective ASR model using unlabeled data across 60 languages.
Pseudo labeling complements Meta AI’s recent advances with wav2vec 2.0, HuBERT, textless NLP and data2vec in learning self-supervised speech representations. Pseudo labeling aims to solve the same overall challenge: how to make the best use of massive amounts of audio data that has not been transcribed by humans. Most importantly, this work extends pseudo labeling to a multilingual setting, where we use a small number of labeled examples to predict labels for an entire data set spanning multiple languages. This enables the development of high-performance open-source multilingual ASR models for many more languages and dialects than previously possible.
For the last few years, pseudo labeling has helped create better speech recognition in monolingual systems but not in multilingual ones. Our most recent work in pseudo labeling builds on advances we have made over the years in iterative pseudo-labeling, a semi-supervised algorithm that efficiently performs pseudo-labeling on unlabeled data and massively multilingual ASR, which trains a single acoustic model across more than 50 languages.
We are also releasing a large-scale multilingual open-source ASR model available to the community, trained only with open-source data.
Pseudo labeling works by using a model trained on labeled data to predict the labels for unlabeled data, and then using those “pseudo labels” to train the model in a supervised way on the unlabeled data. It enables accurate ASR models to be built using far less transcript data. This means that when working with many thousands of hours of unlabeled audio data, we would need to get a transcript for a small subset of the data and then use pseudo labeling to predict the remaining unlabeled audio hours, allowing us to use all the hours of audio data to train the recognition learning system.
The greatest challenge here comes in doing pseudo labeling effectively across many languages, because each one will have its own character set that could interfere with others. Making sure multilingual ASR systems perform well across many languages was the hard part. Multilingual systems also require a lot of engineering innovation to efficiently train these large models. This is where the experience within our group of Meta AI researchers working on pseudo labeling was helpful, as it enabled us to scale to a multilingual model effectively.
Leveraging pseudo labeling to train multilingual ASR systems in particular has many advantages. Building speech recognition systems for 60 languages would normally mean training 60 different ASR models. But multilingual pseudo labeling takes into account commonalities between languages (in a similar manner as our M2M-100 translation model). Rather than developing 60 different models, you train a compact model that can perform better on all 60 of those languages by using cross-lingual learning. Having a multilingual system helps share the knowledge between languages, which facilitates cross-learning for similar languages like Spanish and Italian. We transcribe one to two percent of the data, and the remainder is translated with pseudo labeling, which allows us to leverage a large amount of data far more efficiently.
We use a simple pseudo labeling recipe that works well even with low-resource languages. To make this possible, we input a combination of character sets, including audio from all 60 languages, and for the output, we predict both the characters and the language they belong to. We train a supervised multilingual model, fine-tune it with semisupervised learning on a target language, generate pseudo labels for that language, and train a final model using pseudo labels for all languages, either from scratch or by fine-tuning.
Meta AI is committed to an open-science approach to research, and one of our main goals with this work is to open-source these pseudo labeling models in order to enable others to use them. For this reason, it was important for us to use public data sets to help promote open research. We chose the Common Voice and VoxPopuli data sets, given that they are the most popular multilingual data sets available. Training on the 19 languages of VoxPopuli with pseudo labels improves performance not only on the Common Voice test sets for those 19 languages but also on many other languages, enables training a larger model without overfitting, and helps the model generalize better to a new domain such as LibriSpeech audiobooks.
Large quantities of labeled data are difficult to produce (or simply not available) for many languages. But these advances in pseudo labeling will go a long way in training multilingual ASR systems because this technique is much more efficient than traditional labeling methods.
A multilingual system would be simpler to maintain than a collection of monolingual models, it would enable users to comfortably speak any language without needing to tell the system which language to expect in advance, and it would share knowledge between all languages for improved performance.
This blog post was made possible by the work of Ronan Collobert, Tatiana Likhomanenko, Loren Lugosch, Vineel Pratap, Gabriel Synnaeve, Qiantong Xu (in alphabetical order).