May 19, 2022
Building new augmented reality experiences will require technical breakthroughs beyond just computer vision. In particular, intelligent assistants — ones that can understand natural, nuanced conversational language — will need next-gen speech systems that can do much more than just help us make a hands-free phone call or open an app on our phone.
Tomorrow’s speech recognition systems will need to be far more efficient so they can run on-device on ultralight, compact, and stylish glasses. And they will need to be much more accurate and robust — capable of disambiguating words and understanding context much as people do, able to handle a large vocabulary and uncommon words, and work well even in challenging circumstances with lots of background noise and multiple people speaking.
Meta AI is committed to advancing the state of the art in speech technology and building the technology needed to create new augmented reality and virtual reality experiences, which will be an important part of the metaverse. Speech recognition systems are already an increasingly important part of our products and services. Meta has recently deployed new speech features to support video captioning across many of our apps. This is a great outcome for accessibility, as those who are deaf or have hearing loss can read high-quality captions on videos across products. Captions in FB and IG Stories have even become integral parts of the story’s visual character, with people adjusting font, color, placement to express themselves creatively. Meta’s speech technology also powers hands-free voice interaction on Portal, Quest, and Ray-Ban Stories devices.
In this blog post, we’re highlighting new speech recognition research from Meta, including some of the papers to be presented at the International Conference on Acoustics, Speech and Signal Processing (ICASSP) this month. These projects will help advance efforts both at Meta AI and in Meta’s Reality Labs to build the next generation of devices to help people connect.
We’re excited to push the cutting edge further and enable people to interact with their devices, with content, and with other people in new, more useful, more enjoyable ways.
Speech recognition researchers across the industry and academia are continually publishing ever-improving results on widely used published benchmarks. But despite this important progress, big challenges remain. To some extent, solving these challenges requires a shift in focus away from typical speech recognition metrics, like the total number of errors on a test set (average word error rate), to newer metrics that better capture the deficiencies of current systems.
In many instances, even if the word error rate is quite low on average, misrecognizing certain critical words is enough to ruin the experience. Consider the importance of the esoteric jargon in a science video or the names of your friends in a dictated message. Recognizing these rare or previously unseen words is particularly challenging for modern “end-to-end” speech recognition systems, such as the widely used RNN-T models.
To solve this problem, we previously created a multipronged approach that improved upon the standard shallow fusion approach by incorporating trie-based deep biasing and neural network language model contextualization. This resulted in 20 percent fewer errors compared with shallow fusion. At ICASSP, we are presenting the Neural-FST Class Language Model (NFCLM), which further improves upon this work. The NFCLM models generic background text and structured queries with entities (e.g., song requests) with a unified mathematical framework. This results in a model that achieves a better performance tradeoff between recognition of rare words and more common words, while having the additional benefit of being more than 10 times smaller.
Another area we have focused on is fairness and responsible AI. While the primary word error rate metric used in the research community focuses on a single number that represents the total number of errors in a dataset, it does not capture the differences in performance across different populations. Meta AI recently released the Casual Conversations dataset, a set of videos designed to measure fairness in computer vision systems along dimensions of gender, age, and apparent skin tone. At ICASSP, we are sharing a recent analysis of the speech recognition performance of this corpus along these same dimensions, in which significant variation across gender and skin tone was observed. We are making the transcriptions from the Casual Conversations dataset publicly available in hopes of motivating other researchers to study this problem and create speech systems that work well for all populations. We are also introducing a method to more accurately measure and interpret any differences in speech accuracy among subgroups of interest.
One of the challenges in improving fairness is access to representative training data. One alternative approach to creating a model with matched training data is to create a more universal model that can then be easily fine-tuned to any particular task (or user group). We recently leveraged large-scale semi-supervised training to create ASR models with up to 10 billion parameters using over 4.5 million hours of automatically labeled data. We evaluated this model on a publicly available aphasic speech dataset. Aphasia is a speech-language disorder that arises due to damage to portions of the brain, most commonly resulting from a stroke. Such speech is extremely challenging for speech recognition systems to accurately transcribe. We applied few-shot learning with a relatively small amount of aphasic speech to our universal model. This resulted in over 60 percent fewer errors than a system trained on the aphasic speech only, demonstrating that universal models are a promising avenue for providing high-quality transcription to everyone.
While speech recognition has made incredible progress over the last several years, there are still big challenges to making sure that we build systems that work well across all use cases and work well for everyone. We’ve made significant progress over the past year to this end, but as we say at Meta, the journey is 1 percent finished.
Neural-FST class language model for end-to-end speech recognition
Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions
Omni-sparsity DNN: Fast sparsity optimization for on-device streaming E2E ASR via supernet
Streaming transformer transducer-based speech recognition using non-causal convolution
Pseudo-labeling for massively multilingual speech recognition
Research Director
Foundational models
Latest news
Foundational models