Speech & Audio

Facebook research at Interspeech 2020

October 23, 2020

Facebook will present 23 papers at Interspeech 2020, one of the main conferences for the research community to share progress in the scientific and technological aspects of speech. This research represents significant milestones in our ongoing effort to advance AI in the area of spoken language processing, including speech recognition, speech synthesis, speech translation, voice conversion, audio processing, and more. While our contributions to Interspeech this year span a diverse set of topics, there are some notable themes in the work we are pursuing.

Progress in transformers for ASR

Transformer models have become the industry standard in the natural language processing (NLP) field, delivering state-of-the-art results across a variety of text-based tasks. More recently, we have shown that transformer-based models are able to achieve state-of-the-art performance in speech recognition as well. Some hurdles remain, however, before transformers can become a standard modeling approach for speech. For example, transformers typically process the entire sentence at a time to enable the model to learn relationships among any and all words in a sentence. In the speech world, this is considered offline or batch processing, and is unsuitable for many speech applications. For example, with live captioning or digital assistants, the speech recognition system must process incoming audio in a streaming fashion, continually outputting the hypothesized words as they are spoken.

At Interspeech, we are proposing an augmented memory transformer model that processes only short segments of audio by leveraging a bank of memories that compactly represents the information from prior segments. This approach leads to significant gains over prior attempts to create a streamable transformer model as well as a 15 percent lower word error rate (WER) compared with the standard recurrent networks typically used for streaming applications. While we draw inspiration from transformer models developed for NLP, there are significant differences between text and audio. Compared with text, the neighboring frames in a segment of audio are much more correlated with each other and far less correlated with distant frames. Based on this observation, we created a mechanism for weak attention suppression for transformer-based speech recognition. This modification to the self-attention mechanism used in transformer models provides a 5-10 percent lower WER on a standard benchmark dataset. Finally, we show how both transformer networks and convolutional networks can be used to create highly efficient speech recognition systems, something that can help when performing speech recognition at Facebook scale.

Fewer labels via self- and semi-supervised learning

One of the biggest challenges in AI is the need for labeled training data. For speech recognition, this typically means manually annotating recorded audio with verbatim transcriptions. For translation, this means having transcriptions in one language manually translated into another. Obtaining such annotations can be time-consuming, and in some cases, the expertise required to create such annotations is difficult to find. We have been working to reduce our reliance on labeled data across a variety of AI domains by leveraging self-supervised and semi-supervised learning. This enables us to build AI systems with a greatly reduced need for labeled data. In self-supervision, representations suitable for model learning are created without any labels at all, typically by having a model predict one part of the input from another. In speech, this can mean predicting the segments of audio that occur in the future given the audio observed so far. In semi-supervised learning, self-training is one of the most successful approaches where a small amount of labeled data is used to train an initial “teacher” ASR model. This model is then used to generate hypothesized transcriptions for a much larger set of audio that doesn’t have annotations, which are then used to train a “student” ASR model with similar or sometimes fewer number of parameters.

In a series of papers, we show how both semi-supervised learning and self-supervised learning can lead to improvements across several speech tasks. For example, we compare multiple approaches of self-training with a previously proposed version of weakly supervised learning on a task to transcribe social media videos. We show that when only a small amount of transcribed data is available, we can obtain up to 20 percent fewer errors by leveraging 20,000 to 50,000 hours of unlabeled audio for two low-resource languages. We show that this approach to self-training can be further improved through an iterative process of model training and labeling. Finally, we apply these approaches to a speech translation task, where the AI system has a more complicated job of ingesting audio in one language and outputting the text translated into another language. We show how both self-supervised representations and semi-supervised self-training can improve speech translation in these scenarios.

Transfer learning to benefit low-resource scenarios

In some cases, such as with languages that are less commonly spoken in the world, the amount of data available to train can be quite limited. One well-known approach to serving such languages with high-quality AI systems for speech recognition is through transfer learning. In the context of speech recognition, transfer learning can be used to transfer linguistic information from one system in another. In particular, an AI model trained on a data-rich language (such as English) can be used to seed a model to be trained in a language with limited data. This enables our systems to exploit the commonalities among languages and transfer knowledge from a system trained in one language to benefit a system trained for another.

We share two approaches that prove effective at improving speech recognition in low-resource settings with limited training data. In one paper, we show that we can improve ASR performance by applying transfer learning using an auxiliary speech translation task. In another approach, we build a large-scale multilingual speech recognition system capable of recognizing 50 different languages. By sharing much of the model across a large number of languages, we can effectively share knowledge across these systems and reduce the word error rate by over 20 percent for low-resource languages. In addition, we are releasing a large-scale multilingual speech corpus set based on the well-known Librivox collection of audiobooks. We hope this dataset will encourage further work in this area.

Speech synthesis and reconstruction

There has been growing interest in the research community on applications requiring the generation, enhancement and manipulation of speech signals. To address the issue of speech captured in noisy environments, we present an autoencoder-based method, which combines both time and frequency domain reconstruction terms to remove various kinds of noise, including room reverberation. The method yields state-of-the-art speech enhancement results while running in real time on a laptop CPU.

In order to support multiple voices in speech synthesis and to enable various entertainment applications, we present new research in voice conversion, that is, the task of converting the audio from one person’s voice to another, for both speech and singing. Both methods leverage a pretrained speech recognition model in order to encode the speech signal. The singing method is also conditioned on a pitch extraction network. While these models are fully convolutional and very efficient, they are not causal at the moment. We also present an approach for performing speech synthesis that uses style transfer from an incoming voice request. This would enable a digital assistant to respond to a user based on both the content of the request and how it was spoken. Was the user happy, frustrated, or sad, or in a hurry? Experiments showed that users preferred styled synthesis responses and demonstrated the system’s ability to mimic the style of the incoming speech query.

Here’s the full list of papers we’ll be presenting: