May 1, 2020
Facebook AI researchers are presenting their work virtually at the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) from May 4 to May 8, 2020. This year, we’ve made significant progress in major areas of speech recognition, spanning state-of-the-art research in automatic speech recognition (ASR) including Transformer-based models, using spatial attention for improving far-field ASR, new techniques and resources for semi- and self-supervised training, data augmentation for speech translation, weakly labeled acoustic event detection, and environment-aware noise suppression.
In addition to the list of abstracts below, we’re sharing details on three of these papers. We’ve set a new record for the standard LibriSpeech benchmark using a Transformer-based hybrid ASR system; introduced a new semi-supervised pretraining approach that leverages video comments and descriptions as contextual training data; and released Libri-Light, a large open source dataset that provides three new benchmark tasks, with an emphasis on self-supervision.
These are just a few of the research papers that Facebook AI researchers are presenting at ICASSP this year. We will present our accepted ICASSP 2020 papers via video format. You can find the full list of abstracts below, and we’ll add a link to each video presentation as it becomes available. A full day-by-day schedule of the research being presented at ICASSP is available here.
Our novel Transformer-based acoustic model achieved the lowest word error rate (WER) on LibriSpeech, one of the most popular public datasets in speech recognition — making hybrid performance on par with alternative end-to-end approaches. This is an important milestone because hybrid systems are a relatively mature technology and still serving billions of people every day.
This is the first time the popular Transformer architecture has been successfully applied to acoustic modeling for for hybrid ASR. Some of the key techniques in optimizing Transformers for ASR include using an iterated loss for training deep neural networks as well as replacing the traditional sinusoid embedding with a convolutional method to encode the necessary position information. Both of these methods directly improved WER for this model.
To reduce the need for labeled training data, we developed a novel weakly supervised training approach that leverages text metadata surrounding public videos. Although they aren’t word-for-word transcriptions of speech segments, they are effective as loosely related distant labels of the audio.
We used public videos and text from the accompanying post or comments as clues to train an encoder-decoder transformer model to generate sentences that would be in the video. In our large-scale evaluation of 50,000 hours of public videos, our best encoder-decoder models achieve an average of 20.8 percent WER reduction over a 1,000-hour supervised baseline and an average of 13.4 percent WER reduction when using only the weakly supervised encoder component.
One major hurdle in advancing self-supervised learning in speech research has been the lack of sufficient benchmarks and datasets. To help accelerate research, we’ve introduced Libri-Light, the largest open-source dataset for speech recognition. It includes 60,000 hours of unlabeled data, three new benchmark tasks, and baseline systems and evaluations.
Because Libri-Light uses the standard LibriSpeech as its test dataset, it’s the first benchmark that enables direct comparisons of methods using different learning techniques so researchers can better understand the progress of self-supervised learning compared with the performance of supervised learning. You can learn more about how researchers have used this additional unlabeled training data to improve performance in speech recognition through self-supervised representation learning.
Eliya Nachmani, Lior Wolf
Hypernetworks were recently shown to improve the performance of message passing algorithms for decoding error correcting codes. In this work, we demonstrate how hypernetworks can be applied to decode polar codes by employing a new formalization of the polar belief propagation decoding scheme. We demonstrate that our method improves the previous results of neural polar decoders and achieves, for large SNRs, the same bit-error-rate performances as the successive list cancellation method, which is known to be better than any belief propagation decoders and very close to the maximum likelihood decoder.Watch the virtual presentation here.
Yi-Chen Chen, Zhaojun Yang, Ching-Feng Yeh, Mahaveer Jain, Michael Seltzer
As one of the major sources in speech variability, accents have posed a grand challenge to the robustness of speech recognition systems. In this paper, our goal is to build a unified end-to-end speech recognition system that generalizes well across accents. For this purpose, we propose a novel pre-training framework AIPNet based on generative adversarial nets (GAN) for accent-invariant representation learning: Accent Invariant Pre-training Networks. We pre-train AIPNet to disentangle accent-invariant and accent-specific characteristics from acoustic features through adversarial training on accented data for which transcriptions are not necessarily available. We further fine-tune AIPNet by connecting the accent-invariant module with an attention-based encoder-decoder model for multi-accent speech recognition. In the experiments, our approach is compared against four baselines including both accent-dependent and accent-independent models. Experimental results on 9 English accents show that the proposed approach outperforms all the baselines by 2.3 ∼ 4.5 percent relative reduction on average WER when transcriptions are available in all accents and by 1.6 ∼ 6.1 percent relative reduction when transcriptions are only available in US accent.
Ke Li, Zhe Liu, Tianxing He, Hongzhao Huang, Fuchun Peng, Daniel Povey, Sanjeev Khudanpur
We explore two adaptation approaches of deep Transformer-based neural language models (LMs) for automatic speech recognition. The first approach is a pretrain–fine-tune framework, where we first pretrain a Transformer LM on a large-scale text corpus from scratch and then adapt it to relatively small target domains via fine-tuning. The second approach is a mixer of dynamically weighted models that are separately trained on source and target domains, aiming to improve simple linear interpolation with dynamic weighting. We compare the two approaches with three baselines — without adaptation, merging data, and simple interpolation — on Switchboard (SWBD) and the Wall Street Journal (WSJ). Experiments show that the mixer model generally performs better than baselines and finetuning. Compared with no adaptation, fine-tuning and the mixer approach obtain up to relative 11.5 percent and 14.1 percent WER reductions on SWBD, respectively. The mixer model also outperforms linear interpolation and merging data. On WSJ, the mixer approach achieves a new state-of-the-art WER result.
Andros Tjandra, Chunxi Liu, Frank Zhang, Xiaohui Zhang, Yongqiang Wang, Gabriel Synnaeve, Satoshi Nakamura, Geoffrey Zweig
Deep acoustic models typically receive features in the first layer of the network, and process increasingly abstract representations in the subsequent layers. Here, we propose to feed the input features at multiple depths in the acoustic model. As our motivation is to allow acoustic models to re-examine their input features in light of partial hypotheses we introduce intermediate model heads and loss function. We study this architecture in the context of deep Transformer networks, and we use an attention mechanism over both the previous layer activations and the input features. To train this model's intermediate output hypothesis, we apply the objective function at each layer right before feature re-use. We find that the use of such iterated loss significantly improves performance by itself, as well as enabling input feature re-use. We present results on both Librispeech, and a large scale video dataset, with relative improvements of 10 - 20% for Librispeech and 3.2 - 13% for videos.
Alexei Baevski, Michael Auli, Abdelrahman Mohamed
We compare self-supervised representation learning algorithms which either explicitly quantize the audio data or learn representations without quantization. We find the former to be more accurate since it builds a good vocabulary of the data through vq-wav2vec to enable learning of effective representations in subsequent BERT training. Different to previous work, we directly fine-tune the pretrained BERT models on transcribed speech using a connectionist temporal classification (CTC) loss instead of feeding the representations into a task-specific model. We also propose a BERT-style model learning directly from the continuous audio data and compare pretraining on raw audio to spectral features. Fine-tuning a BERT model on 10 hours of labeled LibriSpeech data with a vq-wav2vec vocabulary is almost as good as the best known reported system trained on 100 hours of labeled data on test clean, while achieving a 25 percent WER reduction on test-other. When using only 10 minutes of labeled data, WER is 25.2 on test-other and 16.3 on test-clean. This demonstrates that self-supervision can enable speech recognition systems trained on a near-zero amount of transcribed data.
Jun Yang, Joshua Bingham
The paper proposes an efficient, robust, and reconfigurable technique to suppress various types of noises for any sampling rate. The theoretical analyses and the subjective and objective test results show that the proposed noise suppression (NS) solution significantly enhances the speech transmission index (STI), speech intelligibility (SI), signal-to-noise ratio (SNR), and subjective listening experience. The STI and SI consist of five levels: bad, poor, fair, good, and excellent. The most common noisy condition is of SNR ranging from -5 to 8 dB. For the input SNR between -5 and 2.5 dB, the proposed NS improves the STI and SI from fair to good. For the input SNR between 2.5 to 8 dB, the STI and SI improve from good to excellent by the proposed NS. The proposed NS can be adopted in both capture and playback paths for voice over internet protocol, voice trigger, and automatic speech recognition applications.
Duc Le, Thilo Koehler, Christian Fuegen, Michael Seltzer
Grapheme-based acoustic modeling has recently been shown to outperform phoneme-based approaches in both hybrid and end-to-end automatic speech recognition (ASR), even on non-phonemic languages like English. However, graphemic ASR still has problems with low-frequency words that do not follow the standard spelling conventions seen in training, such as entity names. In this work, we present a novel method to train a statistical grapheme-to-grapheme (G2G) model on text-to-speech data that can rewrite an arbitrary character sequence into more phonetically consistent forms. We show that using G2G to provide alternative pronunciations during decoding reduces Word Error Rate by 3 percent to 11 percent relative over a strong graphemic baseline and bridges the gap on rare name recognition with an equivalent phonetic setup. Unlike many previously proposed methods, our method does not require any change to the acoustic model training procedure. This work reaffirms the efficacy of grapheme-based modeling and shows that specialized linguistic knowledge, when available, can be leveraged to improve graphemic ASR.
Jacob Kahn, Morgan Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux
We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.
Jianyu Wang, Vinayak Tantia, Nicolas Ballas, Mike Rabbat
The Lookahead optimizer [Zhang et al., 2019] was recently proposed and demonstrated to improve performance of stochastic first-order methods for training deep neural networks. Lookahead can be viewed as a two time-scale algorithm, where the fast dynamics (inner optimizer) determine a search direction and the slow dynamics (outer optimizer) perform updates by moving along this direction. We prove that, with appropriate choice of step-sizes, Lookahead converges to a stationary point of smooth non-convex functions. Although Lookahead is described and implemented as a serial algorithm, our analysis is based on viewing Lookahead as a multiagent optimization method with two agents communicating periodically.
Xiaohui Zhang, Dan Povey, Sanjeev Khudanpur
In this paper, we investigate out-of-vocabulary (OOV) word recovery in hybrid automatic speech recognition (ASR) systems, with emphasis on dynamic vocabulary expansion for both Weight Finite State Transducer (WFST)-based decoding and word-level RNNLM rescoring. We first describe our OOV candidate generation method based on a hybrid lexical model (HLM) with phoneme-sequence constraints. Next, we introduce a framework for efficient second pass OOV recovery with a dynamically expanded vocabulary, showing that, by calibrating OOV candidates’ language model (LM) scores, it significantly improves OOV recovery and overall decoding performance compared to HLM-based first pass decoding. Finally we propose an open-vocabulary word-level recurrent neural network language model (RNNLM) rescoring framework, making it possible to rescore ASR hypotheses containing recovered OOVs, using a single word-level RNNLM ignorant of OOVs when it was trained. By evaluating OOV recovery and overall decoding performance on Spanish/English ASR tasks, we show the proposed OOV recovery pipeline has the potential of an efficient open-vocab word-based ASR decoding framework, with minimal extra computation versus a standard WFST-based decoding and RNNLM rescoring pipeline.
Felix Kreuk, Yaniv Sheena, Joseph Keshet, Yossi Adi
Phoneme boundary detection plays an essential first step for a variety of speech processing applications, such as speaker diarization, speech science, keyword spotting, etc. In this work, we propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection. First, we evaluated our model when the spoken phonemes were not given as input. Results on the TIMIT and Buckeye corpora suggest that the proposed model is superior to the baseline models and reaches state-of-the-art performance in terms of F1 and R-value. We further explore the use of phonetic transcription as additional supervision and show this yields minor improvements in performance but substantially better convergence rates. We additionally evaluate the model on a Hebrew corpus and demonstrate such phonetic supervision can be beneficial in a multilingual setting.
Anurag Kumar, Vamsi Krishna Ithapu
Weakly supervised learning algorithms are critical for scaling audio event detection to several hundreds of sound categories. Such learning models should not only disambiguate sound events efficiently with minimal class-specific annotation but also be robust to label noise, which is more apparent with weak labels instead of strong annotations. In this work, we propose a new framework for designing learning models with weak supervision by bridging ideas from sequential learning and knowledge distillation. We refer to the proposed methodology as SeCoST (pronounced “sequest”) — sequential co-supervision for training generations of students. SeCoST incrementally builds a cascade of student-teacher pairs via a novel knowledge transfer method. Our evaluations on Audioset (the largest weakly labeled dataset available) show that SeCoST achieves a mean average precision of 0.383 while outperforming prior state of the art by a considerable margin.
Jacob Kahn, Ann Lee, Awni Hannun
We revisit self-training in the context of end-to-end speech recognition. We demonstrate that training with pseudo-labels can substantially improve the accuracy of a baseline model. Our approach uses a strong baseline acoustic and language model used to generate the pseudo-labels, filtering mechanisms tailored to common errors from sequence-to-sequence models, and a novel ensemble approach to increase pseudo-label diversity. Experiments on the LibriSpeech corpus show that with an ensemble of four models and label filtering, self-training yields a 33.9 percent relative improvement in WER compared with a baseline trained on 100 hours of labeled data in the noisy speech setting. In the clean speech setting, self-training recovers 59.3 percent of the gap between the baseline and an oracle model, which is at least 93.8 percent relatively higher than what previous approaches can achieve.
Aray D. McCarthy, Liezl Puzon, Juan Pino
We propose auto-encoding speaker conversion for training data augmentation in automatic speech translation (AST). This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker’s voice. Our method compares favorably to SpecAugment on English–French and English–Romanian automatic speech translation tasks as well as on a low-resource English automatic speech recognition (ASR) task. Further, in ablations, we show the benefits of both quantity and diversity in augmented data. Finally, we show that we can combine our approach with augmentation by machine-translated transcripts to obtain a competitive end-to-end AST model that outperforms a very strong cascade model on an English–French AST task. Our method is sufficiently general that it can be applied to other speech generation and analysis tasks.
Weipeng He, Lu Lu, Biqiao Zhang, Jay Mahadeokar, Kaustubh Kalgaonkar, Christian Fuegen
In this paper, we introduce spatial attention for refining the information in multidirection neural beamformer for far-field automatic speech recognition. Previous approaches of neural beamformers with multiple look directions, such as the factored complex linear projection, have shown promising results. However, the features extracted by such methods contain redundant information, as only the direction of the target speech is relevant. We propose using a spatial attention subnet to weigh the features from different directions, so that the subsequent acoustic model could focus on the most relevant features for the speech recognition. Our experimental results show that spatial attention achieves up to 9% relative word error rate improvement over methods without the attention.
Kritika Singh, Dmytro Okhonko, Jun Liu, Yongqiang Wang, Frank Zhang, Ross Girshick, Sergey Edunov, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed
Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labeled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly supervised learning for speech recognition by using loosely related contextual information as a surrogate for ground-truth labels. For weakly supervised training, we use 50,000 hours of public English social media videos along with their respective titles and post text to train an encoder-decoder Transformer model. Our best encoder-decoder models achieve an average of 20.8 percent WER reduction over a 1,000-hour supervised baseline and an average of 13.4 percent WER reduction when using only the weakly supervised encoder for CTC fine-tuning. Our results show that our setup for weak supervision improved both the encoder acoustic representations and the decoder language generation abilities.
Yongqiang Wang, Abdelrahman Mohamed, Duc Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui Zhang, Frank Zhang, Christian Fuegen, Geoffrey Zweig, Michael L. Seltzer
We propose and evaluate Transformer-based acoustic models (AMs) for hybrid speech recognition. Several modeling choices are discussed in this work, including various positional embedding methods and an iterated loss to enable training deep Transformers. We also present a preliminary study of using limited right context in Transformer models, which makes it possible for streaming applications. We demonstrate that on the widely used LibriSpeech benchmark, our Transformer-based AM outperforms the best published hybrid result by 19 percent to 26 percent relative when the standard n-gram language model (LM) is used. Combined with neural network LM for rescoring, our proposed approach achieves state-of-the-art results on LibriSpeech. Our findings are also confirmed on a much larger internal dataset.
Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, Emmanuel Dupoux
Cross-lingual and multi-lingual training of automatic speech recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabeled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par with or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.