Online speech recognition with wav2letter@anywhere


The process of transcribing speech in real time from an input audio stream is known as online speech recognition. Most automatic speech recognition (ASR) research focuses on improving accuracy without the constraint of performing the task in real time. For applications like live video captioning or on-device transcriptions, however, it is important to reduce the latency between the audio and the corresponding transcription. In these cases, online speech recognition with limited time delay is needed to provide a good user experience. To solve for this need, we have developed and open-sourced wav2letter@anywhere, an inference framework that can be used to perform online speech recognition. Wav2letter@anywhere builds upon Facebook AI’s previous releases of wav2letter and wav2letter++.

Most existing online speech recognition solutions support only recurrent neural networks (RNNs). For wav2letter@anywhere, we use a fully convolutional acoustic model instead, which results in a 3x throughput improvement on certain inference models and state-of-the-art performance on LibriSpeech. For a system to run at production scale (on server CPUs or on-device in a low-power environment) one needs to ensure that the system is computationally efficient. Taking an ASR system from a research environment to a low-latency, computationally efficient system that is also highly accurate involves nontrivial changes to both the implementation and the algorithms. This post explains how we created wav2letter@anywhere.

This diagram shows how our online system processes speech. Each chunk of speech is first fed into an acoustic model, which computes word-piece scores. These scores are then combined with a language model via a lightweight beam search decoder, which outputs the most likely sequence of words based on the input sequence and the selected language model.

Wav2letter@anywhere inference platform

Part of the wav2letter++ repository, wav2letter@anywhere can be used to perform online speech recognition. The framework was built with the following objectives:

  • The streaming API inference should be efficient yet modular enough to handle various types of speech recognition models.

  • The framework should support concurrent audio streams, which are necessary for high throughput when performing tasks at production scale.

  • The API should be flexible enough that it can be easily used on different platforms (personal computers, iOS, Android).

Our modular streaming API. allows the framework to support various models, including RNNs and convolutional neural networks (which are faster). Written in C++, wav2letter@anywhere is stand-alone and as efficient as possible, and it can be embedded anywhere. We use efficient back ends, such as FBGEMM, and specific routines for iOS and Android. From the beginning, it was developed with streaming in mind (unlike some alternatives that rely on generic inference pipeline), allowing us to implement an efficient memory allocation design.

Much of the recent work in latency-controlled ASR uses latency-controlled bidirectional LSTM (LC-BLSTM) RNNs, RNN Transducers (RNN-T), or variants of these methods. Departing from these previous works, we propose a fully convolutional acoustic model with connectionist temporal classification (CTC) criterion. Our paper shows that such a system is significantly more efficient to deploy while also achieving a better word error rate (WER) and lower latency.

Low-latency acoustic modeling

An important building block of wav2letter@anywhere is the time-depth separable (TDS) convolution, which yields dramatic reductions in model size and computational flops while maintaining accuracy. We use asymmetric padding for all the convolutions, adding more padding toward the beginning of the input. This reduces the future context seen by the acoustic model, thus reducing the latency.

TDS convolution block.

In comparing our system with two strong baselines (LC BLSTM + lattice-free MMI hybrid system and LC BLSTM + RNN-T end-to-end system on the same benchmark, we were able to achieve better WER performance, throughput, and latency. Most notably, the models are 3x faster even when the inference is run in FP16, while the inference for baselines is run in INT8.

Experimental results comparing our TDS + CTC system with other systems.

In a recent work, we leveraged wav2letter++ with modern acoustic and language model architectures in both supervised and semi-supervised settings. We revisited a standard semi-supervised technique, generating pseudo-labels on 60,000 hours of unlabeled audio, using an acoustic model trained on 1,000 hours of labeled data. We then trained a new acoustic model using the whole 61,000 hours of pseudo-labeled data, which established a new state of the art on LibriSpeech. We saw a relative improvement of more than 16 percent in comparison with state-of-the-art models trained in a supervised setting. We are releasing models related to this paper as well as latency-constrained models for fast real-time inference suitable for wav2letter@anywhere.

We have made extensive improvements since open-sourcing wav2letter++ a year ago, including beefing up decoder performance (10x speedup on seq2seq decoding); adding python bindings for features, decoder, criterions, etc.; and better documentation. We believe wav2letter@anywhere represents another leap forward by enabling online speech recognition and significantly reducing the latency between audio and transcription. We are excited to share the open source framework with the community. For more information about wav2letter@anywhere, read the full paper and visit the wiki.

We’d like to thank Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, and Gabriel Synnaeve for their work on wav2letter@anywhere.

Written By

Vineel Pratap

Research Engineer