Self-supervision and building more robust speech recognition systems

September 19, 2019

Advancing automatic speech recognition (ASR) systems is important for improving accessibility and communication across communities through a range of applications — from video and photo captioning, to identifying harmful content, to building more useful AI assistive technologies. But building highly accurate speech recognition models typically requires large amounts of computing resources and thousands of hours of manually annotated audio transcriptions, which are often difficult to obtain or simply not available for many languages.

As part of Facebook AI’s long-term efforts to advance self-supervised systems, we’re sharing details on three new research projects that push the boundaries of speech recognition. We’ve introduced wav2vec, a new, self-supervised approach that beats traditional ASR systems that rely solely on transcribed audio, including a 22 percent accuracy improvement over Deep Speech 2, while using two orders of magnitude less labeled data. We’ve created an acoustic model architecture that’s an order of magnitude faster and more efficient than previous methods — a significant advance for semi and self-supervised training. And we’ve developed a more accurate and versatile approach for approach for transcribing proper names and other words that are outside of ASR systems’ lexicons. To help the ASR community push research forward, we’ve open-sourced our models for all three projects ( here, here, and here — respectively).

These advances make it possible to build more robust ASR systems for low-resource languages, which lack large annotated datasets, and languages like Thai and Japanese, which are challenging for traditional systems because they are written without spaces between words.

Wav2vec: State-of-the-art speech recognition through self-supervision

Our new self-supervised approach to ASR, wav2vec, achieved the best result to date on the popular WSJ benchmark while using two orders of magnitude less training labeled data than a comparable system.

The algorithm works with existing ASR systems and uses raw audio as training data without the need for additional written transcriptions, demonstrating that self-supervision can make even high-performing speech recognition models more effective. For example, our wav2vec-based system yielded a 22 percent relative error reduction over Deep Speech 2, the best comparable system in the literature today.

Building state-of-the-art speech recognition systems typically requires thousands of hours of transcribed audio. Deep Speech 2, for instance, uses 12,000 hours of transcribed audio. We fine-tuned our pretrained model on only 80 hours of transcribed audio. This reduces the amount of labeled data while also improving on word error rate (WER). This is an important milestone in our efforts to expand speech recognition capabilities to languages without the high volumes of labeled speech required for standard ASR systems.

Wav2vec trains models by making them pick between existing 10-milliseconds-long audio clips and distractor clips swapped in from elsewhere in the same example. Models must also predict the correct audio clips further into the future, increasing the difficulty and utility of the task for training.

For more technical details about how we built wav2vec, read our in-depth blog here .

Enabling recognition of out-of-vocabulary words with lexicon-free beam-search decoding

We’ve used self-supervision to achieve state-of-the-art performance in correctly recognizing words that are outside of the training lexicon. Our method uses self-supervised language modeling at the character level, where we predict whole words one letter at a time, to effectively handle these out-of-vocabulary (OOV) words. While the standard lexicon-based approach is inherently unable to recognize any OOV words, our lexicon-free approach with character-based gated convolutional language models, ConvLM, was able to correctly recognize up to 33 percent of OOV occurrences for clear speech with no background noise.

In this demonstration, we show that our new lexicon-free decoder with character-level language model can more accurately recognize the out-of-vocabulary word compared to the standard word-based, lexicon approach.

Most algorithms that work to transcribe words define a vocabulary by computing the frequency of all words. Typically, words are not recognized if they don’t meet a specific threshold (or are OOV). This process results in one largely unsolved challenge in the industry — accurately handling words that are names or locations or are otherwise absent from the vocabulary.

We leveraged our wav2letter++ framework for the acoustic model and our fairseq-py toolkit for the language model, in order to focus on language model training on the LibriSpeech and WSJ datasets. We show that with a large enough character context, our approach produces significant improvements in WER and character error rate (CER) on utterances that include OOV words. Our system delivers a better WER and CER than any previous character-based ASR model without a lexicon.

The graphic illustrates ASR using a standard lexicon-based approach with word-level language modeling. Here, the ASR system is unable to recognize the name “Sam.”

The graphic illustrates ASR using our new lexicon-free approach with character-level language model. Here, the ASR system recognizes the name “Sam.”

In fact, because the character-level language model also reduces overfitting during beam-search decoding, our new self-supervised wav2vec algorithm leverages this character-level beam-search decoder to help improve its efficiency as well. Once we proved this lexicon-free approach worked well, we also used it with our new, faster seq2seq model to perform word-piece modeling (an intermediary representation of text between words and characters).

This lexicon-free approach also opens up possibilities for solutions not only to recognize names and other OOV words, but also to improve speech recognition for languages that lack spaces between words, such as Japanese and Thai. To encourage further research, we’ve prepared a standalone library for our beam-search decoder with a Python wrapper, so people can use PyTorch acoustic models and fairseq language models and plug them into our wav2letter beam-search decoder. And we’ve open-sourced our trained models on LibriSpeech with hopes that the AI community can experiment and make further progress on handling OOV using character-level modeling.

Faster, more lightweight seq2seq model for speech recognition

Self-supervised algorithms like wav2vec dramatically decrease the need for labeled training data, but they still require extremely large amounts of unlabeled data. Given this need, building a lightweight, highly efficient architecture for ASR is an important step in improving runtime performance and accuracy.

We’ve built a new sequence-to-sequence (seq2seq) encoder-decoder model for speech recognition that requires 75 percent fewer parameters and is an order of magnitude more efficient than previous models while still delivering a better WER.

The time-depth separable convolution block was important to achieving this level of efficiency because it dramatically reduced the number of parameters in the model. This novel connectivity structure works well for speech recognition because it’s both efficient and able to have a large receptive field. Furthermore, we use an efficient decoder part of the model that is lightweight and highly parallelizable during training. Compared with the self-attention found in the Transformer model, our architecture scales linearly with the input sequence length rather than quadratically, and it can be much more efficient with long inputs commonly found in speech recognition.

In this flow chart, we present the time-depth separable convolution model architecture. The sub-blocks of its convolution layer are a 2D convolution over time followed by a fully connected block.

Similar to our lexicon-free beam-search decoding research, we leveraged wav2letter++ framework for training and evaluating end-to-end speech models, we coupled our new architecture with a convolutional language model. This fast seq2seq model enables ASR deployment on smaller devices and scales well to the large amount of data needed by self-supervised and semi-supervised learning algorithms.

This new architecture is simple for researchers to implement in their own work in order to achieve a performance boost. First, our robust beam-search decoder integrates well with an externally trained language model and provides a strong improvement in WER. Second, our time-depth separable convolution architecture is accurate, efficient, and simple to implement using only off-the-shelf, highly optimized software for 2D convolutions available from cuDNN.

The self-supervised future of speech recognition

This emphasis on self-supervised techniques, which require far less labeled training data and are less reliant on language-specific fine tuning, will help ensure that state-of-the-art ASR benefits everyone, including speakers of low-resource languages — beyond English and toward a more global perspective.

We’ve seen promising results using self-supervision in our recent advances in natural language processing, particularly with machine translation. With approximately 6,500 languages spoken around the world — and with over 50 percent of the Facebook community speaking a language other than English — exploring self-supervised methods that can speed ASR development is an important research pursuit for Facebook as well as for the broader AI research community.

We’re currently working on making these new advances fast enough for full deployment, which would further improve our systems to proactively find and flag harmful content and deliver other benefits to the people using our platforms. But we’re still a long way from unlocking the true potential of self-supervised learning. We hope that releasing our code and providing off-the-shelf solutions will help accelerate progress as part of our ongoing commitment to open science.

Each of these research projects will be presented at the Interspeech 2019 conference. See the full list of research papers from Facebook AI here.