Today, speech technology is only available for a small fraction of the thousands of languages spoken around the world because traditional systems need to be trained on large amounts of annotated speech audio with transcriptions. Obtaining that kind of data for every human language and dialect is almost impossible.
Wav2vec works around this limitation by requiring little to no transcribed data. The model uses self-supervision to push the boundaries by learning from unlabeled training data. This enables speech recognition systems for many more languages and dialects, such as Kyrgyz and Swahili, which don’t have a lot of transcribed speech audio. Self-supervision is the key to leveraging unannotated data and building better systems.
I don't want speech technology to only be accessible to people who speak English and who speak English without an accent. I want this to be accessible to more people. I want to support all the languages in the world.
The grand motivation behind all of this is to build systems that can learn similarly to how humans do and make connections between different pieces of data. To build models that can learn really good representations by simply observing the world around them.
How do you access information if you don't know how to read or write? You can speak. There are nearly 7,000 languages in the world. We know how to do speech recognition decently for maybe a hundred, and really well for maybe 10. If we can lower the bar to build speech technology for many more languages, then we can make information more accessible.
Our researchers understand how essential speech recognition can be in everyday life, from asking a virtual assistant what the weather will be, to telling a GPS app to find the nearest gas station. But for those who don’t speak English or do so with a heavy accent, it can be difficult to use speech recognition devices. The wav2vec team believes that technology like this should benefit all people and not only those who are fluent in one of the world’s most widely spoken languages. That’s why our researchers have open-sourced the code and pre-trained models with the hopes to scale impact and enable other researchers to build better speech recognition systems that are not dependent on annotated data.
The wav2vec model is trained by predicting speech units for masked parts of speech audio. It learns basic units that are 25ms long to enable learning of high-level contextualized representations. This enables us to build speech recognition systems that can outperform the best semi-supervised methods, even with 100 times less labeled training data. Our model learns a set of speech units, which are shorter than phonemes, to describe the speech audio sequence. Since this set is finite, the units encourage the model to focus on the most important factors to represent the speech audio.
Wav2vec learns from recorded speech audio and unpaired text, lessening the need for transcriptions. The self-supervised model segments the voice recording into speech units that loosely correspond to individual sounds. Training a generative adversarial network consisting of a generator and a discriminator network helps the model learn to recognize the words in the audio recording. Reducing our dependence on annotated data by using self-supervised learning is an important part of expanding access to speech recognition tools to many more people around the world.
Alexei Baevski, Steffen Scheider, Wei-Ning Hsu, Alexis Conneau, Henry Zhou, Ronan Collobert, Abdelrahman Mohamed, Michael Auli
Facebook AI and Carnegie Mellon University’s Department of Chemical Engineering have joined to collaborate on the Open Catalyst Project.
AI at Meta and NYU Langone Health have developed a way to use AI for MRIs that need only a quarter of the raw data traditionally required for a full MRI.
Foundational models
Latest news
Foundational models