February 1, 2022
Just as in other applications of AI, fairness is an important concern for speech recognition systems. If a tool performs well only for certain accents or speaking styles or vocal ranges, it fails to serve everyone as intended. But how can we assess whether a speech recognition system is working well for different groups of people — young and old; men, women, and nonbinary people; different ethnicities and cultural groups; and others?
AI systems need data and benchmarks in order to measure performance, and the research community simply hasn’t had adequate ways to assess fairness concerns for speech recognition systems. There have been only a handful of studies of how well speech recognition models perform for different groups of people and these analyses have utilized existing speech recognition data sets, which each reflect only some attributes of the speakers. For example, one data set may have gender data but not age data, while another has the reverse. That is why we’ve added thousands of expert human speech transcriptions to Meta AI’s open source Casual Conversations data set, so that researchers can utilize a shared data set for detecting speech recognition systems’ performance gaps for different groups of people. Researchers can also use the audio and transcription data, in compliance with the data set terms and conditions, to develop new speech recognition systems that exhibit less bias.
The Casual Conversations data set contains 45,186 videos of people having unscripted conversations. Meta AI shared it last spring to help researchers evaluate their computer vision models across a diverse set of age groups, genders, apparent skin tones, and ambient lighting conditions. To our knowledge, it’s the first publicly available data set featuring paid individuals who provided their age and gender themselves — as opposed to information labeled by third parties or estimated using AI models.
We believe this human-centered approach has produced a valuable resource for researchers with a more accurate representation of the attributes released. We hope this can help the research community in working collaboratively to make AI fairer and more inclusive. Now, with the addition of transcriptions to the data set, we're excited to expand Casual Conversation's utility to a new domain – automatic speech recognition.
In order to explore how model architectures and training regimens would affect performance for different groups of people, we trained four new speech recognition systems from scratch using state-of-the-art architectures and evaluated them using the Casual Conversations data set. All four used recurrent neural network transducer models, but two were trained with supervised data only, while two included semi-supervised training. We also varied the training data, with one model using audiobook data and the others using audio from publicly shared social media videos. Details are available in our accompanying research paper. We found several noteworthy results:
All four models demonstrated a performance gap (as measured by Word Error Rate, or WER) between gender groups, with fewer speech recognition errors for females than for males. For the “other” category, the amount of available data is insufficient to draw any conclusions at the moment.
We did not observe statistically significant performance gaps for different age groups.
The speech recognition systems performed worse for speakers with relatively darker skin tones than for those with lighter skin tones. We made this measurement as we had the relevant labels and it is important to understand performance on this axis. While apparent skin tone is not the most directly relevant characteristic to test auditory models, it is possible that it may correlate to certain auditory characteristics such as accent as we see some speech recognition performance differences between skin tone groups. However, more investigation is needed to surface the cause beyond speculation.
We found some models exhibited a larger variation in performance across the different subgroups. For example, the models trained with data from audiobooks showed a larger performance gap between sub-groups as compared to the ones trained using public social media videos. The video data set contains a diverse array of speakers, accents, topics, and acoustic conditions, whereas the audiobooks data set consists solely of books read aloud by a small group of speakers in controlled environments. This may account for some of the disparity between these models’ performance.
While the models built from scratch and tested in this work are not currently in production at Meta, these findings can inform research at Meta and in the community at large to ensure that ASR models of all kinds are built with fairness in mind.
These early findings are only limited to the specific data set and models explored above, but they show that more work is needed to assess model bias in speech recognition systems. They also demonstrate how data disaggregated by demographic attributes can be useful in evaluating the extent of these challenges. Our findings do not show how best to modify AI systems to improve fairness, however. For example, our initial analysis found that using Casual Conversations data to fine-tune pre-trained models did not significantly reduce the error rate gaps between groups. The data set can be used to develop novel techniques for training models that perform similarly across all groups without knowing the specific protected attributes (e.g., gender or age) of the data points. Other possible mitigation strategies include using training data that features a diverse set of speakers, recording situations, and speaking styles. More research is needed to develop either of these options.
There is another important limitation of the Casual Conversations data set: It was collected only in the United States and does not cover all English-speaking populations. To measure performance differences across groups more thoroughly, we will need data sets and benchmarks that work across different languages and locales. A further limitation to note is that the data set does not have information on the speakers’ accents or first language. We recommend that further research consider responsibly collecting accent or first language characteristics so that researchers can evaluate performance differences across these relevant subpopulations.
Benchmark data sets are crucial to advancing AI research. Considering the impact of the availability of relevant data sets on the advancement of solutions to various problems (e.g., Imagenet on object recognition), Casual Conversations is an initial effort aimed towards driving research in the under-investigated area of fairness in speech recognition. Although there are certain speech data sets investigating certain underrepresented groups (e.g., CORAAL), the Casual Conversations data set provides more diversity and larger amounts of data for this type of research.
Building on this starting point, Meta AI is planning to perform research on speech recognition models that exhibit minimal performance differences across subpopulations. Our findings will inform the work of our cross-disciplinary Responsible AI (RAI) team, which has built and deployed tools like Fairness Flow to help our AI engineers detect certain forms of potential statistical bias in certain types of AI models and labels commonly used at Meta.
We hope the research community extends annotations of Casual Conversations for their own computer vision and audio applications, in line with our data use agreement. Collaborative research — through open source tools, papers, and discussions — is vital to address fairness issues in AI. By leveraging a diverse range of perspectives from the research community, we can identify new fairness concerns and questions and look for new ways of addressing them to ensure this technology is being developed responsibly and serving everyone.
Technical Program Manager
Applied Research Scientist