High-performance speech recognition with no supervision at all

May 21, 2021

What the research is:

Whether it’s giving directions, answering questions, or carrying out requests, speech recognition makes life easier in countless ways. But today the technology is available for only a small fraction of the thousands of languages spoken around the globe. This is because high-quality systems need to be trained with large amounts of transcribed speech audio. This data simply isn’t available for every language, dialect, and speaking style. Transcribed recordings of English-language novels, for example, will do little to help machines learn to understand a Basque speaker ordering food off a menu or a Tagalog speaker giving a business presentation.

This is why we developed wav2vec Unsupervised (wav2vec-U), a way to build speech recognition systems that require no transcribed data at all. It rivals the performance of the best supervised models from only a few years ago, which were trained on nearly 1,000 hours of transcribed speech. We’ve tested wav2vec-U with languages such as Swahili and Tatar, which do not currently have high-quality speech recognition models available because they lack extensive collections of labeled training data.

Wav2vec-U is the result of years of Facebook AI’s work in speech recognition, self-supervised learning, and unsupervised machine translation. It is an important step toward building machines that can solve a wide range of tasks just by learning from their observations. We think this work will bring us closer to a world where speech technology is available for many more people.

How it works:

Something Went Wrong
We're having trouble playing this video.

Wav2vec-U learns purely from recorded speech audio and unpaired text, eliminating the need for any transcriptions. Our framework takes a novel approach compared with those of previous ASR systems: The method begins with learning the structure of speech from unlabeled audio. Using our self-supervised model, wav2vec 2.0 and a simple k-means clustering method, we segment the voice recording into speech units that loosely correspond to individual sounds. (The word cat, for example, includes three sounds: “/K/”, “/AE/”, and “/T/”.)

To learn to recognize the words in an audio recording, we train a generative adversarial network (GAN) consisting of a generator and a discriminator network. The generator takes each audio segment embedded in self-supervised representations and predicts a phoneme corresponding to a sound in language. It is trained by trying to fool the discriminator, which assesses whether the predicted phonemes sequences look realistic. Initially, the transcriptions are very poor, but over time, and with the feedback of the discriminator, they become accurate.

Something Went Wrong
We're having trouble playing this video.

Facebook AI’s new unsupervised speech recognition model is the latest development of several years of work in speech recognition models, data sets, and training techniques. In this timeline, we highlight key achievements, including wav2letter, unsupervised machine translation, wav2vec, Librilight, wav2vec 2.0, XLSR, wav2vec 2.0 + self-training.

The discriminator itself is also a neural network. We train it by feeding it the output of the generator as well as showing it real text from various sources that were phonemized. This way it learns to distinguish between the speech recognition output of the generator and real text.

To get a sense of how well wav2vec-U works, we evaluated it first on the TIMIT benchmark, where it reduced the error rate by 57 percent compared with the next best unsupervised method.

Wav2vec-U compared with the previous best unsupervised method on the TIMIT benchmark.

We were also interested in how wav2vec-U performed compared with supervised models on the much larger Librispeech benchmark, where models typically use 960 hours of transcribed speech data. We found wav2vec-U as accurate as the state of the art from only a few years ago — while using no labeled training data at all. This shows that speech recognition systems with no supervision can achieve very good quality.

Wav2vec-U on the Librispeech benchmark (test-other) compared with the best systems over time, which typically use 960+ hours of transcribed data. Source:

TIMIT and Librispeech measure performance on English speech, for which good speech recognition technology already exists, thanks to large, widely available labeled data sets. However, unsupervised speech recognition is most impactful for languages for which little to no labeled data exists. Therefore, we tried our method on other languages and we think this technology is particularly interesting for languages for which there simply are not many data resources, such as Swahili, Tatar, and Kyrgyz.

We also trained wav2vec-U on other languages.

Why it matters:

AI technologies like speech recognition should not benefit only people who are fluent in one of the world’s most widely spoken languages. Reducing our dependence on annotated data is an important part of expanding access to these tools. Facebook AI has recently made rapid progress in this area, first with the introduction of wav2vec and then wav2vec 2.0, and now with wav2vec-U. We hope this will lead to highly effective speech recognition technology for many more languages and dialects around the world. We are releasing the code to build speech recognition systems using just unlabeled speech audio recordings and unlabeled text.

More generally, people learn many speech-related skills just by listening to others around them. This suggests that there is a better way to train speech recognition models, one that does not require large amounts of labeled data. Developing these sorts of more intelligent systems is an ambitious, long-term scientific vision, and we believe wav2vec-U will help us advance toward that important and exciting goal.

Get it on GitHub
Read the paper

Written By

AI Researcher

AI Researcher

AI Researcher

AI Researcher