Speech recognition systems should be robust enough to work well for all groups of people — including those with different speaking styles, accents, and other characteristics. In order to measure how automatic speech recognition (ASR) tools perform for different demographic groups, AI practitioners need diverse speech data. In keeping with our approach to open science, we’re releasing a fairness-oriented evaluation dataset that consists of audio commands taken from a diverse group of consenting paid participants. Along with the dataset, we also developed a privacy-preserving approach that improves the robustness and fairness of an automatic speech recognition system by using an unsupervised clustering method.
This technique enables researchers to improve ASR performance without relying on data related to demographic characteristics, or data that represents someone’s speech, known as speaker embeddings. Instead of dividing a dataset based on speakers’ demographic information — such as their age group or gender — our proposed algorithm clusters speech at the utterance level. A single cluster will contain similar utterances from a diverse group of speakers. We can then train our model using the various clusters and use fairness datasets to measure how the model impacts outcomes across different demographic groups. The clustering is performed using unsupervised learning, leveraging algorithms to analyze and group unlabeled data sets without human intervention.
During testing, we observed that a model trained in this manner improved speech recognition accuracy for all measured demographic groups, and in particular for different accents, which are identified in sociolinguistics as a way of pronouncing a language that is distinctive to a country, area, social class, or individual. While our proposed algorithm was built using English-language data, we hope these approaches can be extended to work for other languages as well.
We believe fairness is a process. Meta continues to introduce research like this and produce new innovations to support AI systems that are robust, fair, and inclusive. In the future, both the modeling technique and the dataset could help improve automatic speech recognition systems for use cases including AI assistants, translation tools, and much more, supporting billions of people around the globe across a variety of languages.
How data clustering preserves privacy: Our approach
We run unsupervised clustering on the utterances. Since there isn’t a need for human annotators to label the clusters, this process means we have no insight into the content of each cluster and why some utterances are grouped together.
We then trained our model on de-identified, publicly available Facebook videos in English. It was evaluated on two datasets. The first was a de-identified dataset collected from a data supplier for ASR that includes 48,000 utterances from 867 speakers. Speakers had the option to self-identify across demographic categories such as age, gender, ethnicity, English accent, and first or home language. The second dataset is Casual Conversations v1, a dataset of transcribed speech that Meta built and made publicly available in 2021. That data includes self-provided age and gender by participants in the U.S, as well as apparent skin tone categories that were annotated. (After we concluded our research, Meta released Casual Conversations v2, an expanded dataset that includes additional self-identified and annotated categories of video recordings of people in seven countries.)
First, we segment the training data into 10-second chunks. We then extract utterance level embeddings for each of the segments. Using these embeddings, we then train a principal component analysis model for dimensionality reduction and use that to cluster the data using the well-known K-means algorithm.
Because most parameters are implicitly shared by all the clusters, the speech recognition model learns to generalize across different clusters. We use a “masking” strategy by assigning a probability of sampling the current data into an “unknown” cluster. This ensures that, during training, the model will try to infer the right cluster for the unknown domain, making it more robust — even when the correct cluster id is missing. At inference time, in order to avoid running a speaker identification model on the incoming data, which would be problematic from both a privacy and latency perspective, we give an “unknown” cluster id to the test data. Because our proposed algorithm doesn’t rely on demographic information, it’s much easier to use it in production applications.
How the results of the proposed algorithm stack-up
Experimental results show that our approach improves model performance on all demographic groups in our evaluation datasets, though by far the largest gains are with respect to more inclusivity of accents. The results in our research paper show that we don’t have to sacrifice overall performance to improve the fairness of the model.
The cluster-based model improves performance on all age groups, with the largest gain for people ages 66 – 85, which is typically an underrepresented group in ASR training data. Since this group is underrepresented in training data, there hasn’t been a clear way to improve the model for this group via normal data collection and training. With our method, we show improvements for this group without the need for actual training data to be collected. Overall, based on word error rates across different groups, such as age, gender, ethnicity, and native vs. non-native language, our results show a 10% relative improvement on an ASR model trained for dictation, messaging, and voice commands after training using our proposed method.
We also found that geographic and self-identified ethnicity labels are poor indicators of accent or speaker variability. Therefore, we suggest that unsupervised clustering is better than using metadata to train for fairness and robustness. We’re exploring how to incorporate this fairness approach into our products.
Our open-sourced dataset
The current public datasets for speech recognition tend not to focus specifically on improving fairness. Our dataset includes approximately 27,055 utterances in recorded speech by 595 people in the U.S. who were paid to record and submit audio of themselves saying commands. They self-identified their demographic information, such as age, gender, ethnicity, geographic location and whether they consider themselves native English speakers.
The verbal commands included in this dataset are categorized into seven domains, primarily serving voice assistant use cases — music, capture, utilities, notification control, messaging, calling, and dictation — that can support researchers who are building or have models in those areas. In response to prompts that relate to each of these domains, dataset participants provided their own audio commands. Some examples of prompts were asking how they would search for a song or make plans with friends, including deciding where to meet. Our dataset includes the audio and transcription of participants’ utterances.
By releasing this dataset, we hope to further motivate the AI community to continue improving the fairness of speech recognition models, which will help everyone have a better experience using applications with ASR.
Building more inclusive models
We believe that introducing our privacy-preserving approach to improve fairness and robustness of ASR models will inspire AI systems that work well for different speakers under different speaking conditions. That can include voice commands for AI assistants or medical transcription for better health care.
Measuring fairness often calls for the analysis of demographic data to see if an AI model is fair across all groups. Using collected data — in which participants self-identify across different dimensions — is one way in which we navigate the tension between measuring fairness and preserving privacy.
Our proposed algorithm is part of Meta’s long-term focus on responsible AI and just one part of our holistic approach to address fairness issues. This new fairness dataset includes explicit permission from paid participants, while our new approach is one of our many efforts to build AI systems that work to preserve privacy. In the future, we want to explore the adaptation of this approach for other languages.
Meanwhile, our team continues to invest in building more fair and inclusive models, while maintaining high standards of overall accuracy. The hope is that fairness will become an integral part of how speech models are trained and evaluated moving forward.
This blog post was made possible by the work of Irina-Elena Veliche, Vineeth Ayyat Kochaniyan, Mike Seltzer, Fuchun Peng, and Pascale Fung.