March 9, 2023
For AI to serve communities fairly, researchers need diverse and inclusive datasets in order to rigorously and thoughtfully evaluate fairness in the models they build. In applications of computer vision and speech recognition in particular, AI researchers need data to assess how well a model works for different demographic groups. And this data can be difficult to gather due to complex geographic and cultural contexts, inconsistency between different sources, and challenges with accuracy in labeling.
Today, we are releasing Casual Conversations v2, a consent-driven, publicly available resource that enables researchers to better evaluate the fairness and robustness of certain types of AI models. The dataset was informed and shaped by a comprehensive literature review around relevant demographic categories, and was created in consultation with internal experts in fields such as civil rights. This dataset offers a granular list of 11 self-provided and annotated categories to further measure algorithmic fairness and robustness in these AI systems. The dataset features 26,467 video monologues recorded in seven countries featuring 5,567 paid participants who provided self-identified attributes such as age and gender, and is the next generation following the original Casual Conversations consent-driven dataset, which we released in 2021. To our knowledge, it’s the first open source dataset with videos collected from multiple countries using highly accurate and detailed demographic information to help test AI models for fairness and robustness.
Since launching the original dataset two years ago, we’ve continued to collaborate with experts and expand the dataset, including adding expert human speech transcriptions, to help the research community assess fairness concerns in additional domains. Like the original dataset, Casual Conversations v2 is made available to the public under a dataset license agreement to aid as many researchers as possible in their efforts to measure fairness and support robustness. By leveraging this dataset, researchers can investigate, for example, whether a speech recognition system is working consistently across a variety of demographic characteristics and environments.
While the first version of Casual Conversations was a major step to help researchers establish fairness benchmarks, there were some limitations. The first dataset’s labels included only age, three subcategories of gender (female, male, and other), apparent skin tone, and ambient lighting. With the understanding that there are numerous underrepresented communities of people, languages, and attributes, we wanted to dig deeper into subcategories to identify potential model gaps in fairness and robustness.
Along with Prof. Pascale Fung, director of the Centre for AI Research, and other researchers from Hong Kong University of Science and Technology, we conducted a robust literature review of governmental and academic resources for potential categories and then published our findings for other researchers to build upon this work. We also consulted internal civil rights experts and domain experts at Meta.
To be more inclusive and to mitigate issues of subjective annotations for many of the dataset’s categories, we asked participants to provide some of their information in their preferred language. Even though participants agreed to share their information to be used in AI tasks via consent forms, providing this self-labeled information was optional. Therefore, we allowed participants to input certain information in their own words.
Of the 11 categories included in Casual Conversations v2, seven were provided by the participants, while the remaining were manually labeled by annotators. The self-provided categories are age, gender, language/dialect, geolocation, disability, physical adornments, and physical attributes. For the remaining categories (voice timbre, apparent skin tones, recording setup, and activity), we trained vendors with detailed guidelines to enhance consistency and reduce the likelihood of subjective annotations during the labeling process.
With the expansion of Casual Conversations, we wanted to support a multilingual dataset, particularly as language understanding can support the development of inclusive natural language processing models.
In addition to an expanded list of categories, Casual Conversations v2 differs from the first version with the inclusion of participant monologues recorded outside the United States. The seven countries included in v2 are Brazil, India, Indonesia, Mexico, Vietnam, Philippines, and the United States. In the future, we hope to further expand the dataset to additional geographies. Another difference in the latest dataset is that participants were given the chance to speak in both their primary and secondary languages. The types of monologues include both scripted and nonscripted speech.
While the introduction of recording participants from multiple geographies provided a new set of logistical challenges and opportunities, it also added further complexity in identifying categories relevant to even more diverse communities. With increasing concerns over the performance of AI systems across different skin tone scales, we decided to leverage two different scales for skin tone annotation. The first is the six-tone Fitzpatrick scale, the most commonly used numerical classification scheme for skin tone due to its simplicity and widespread use. The second is the 10-tone Skin Tone scale, which was introduced by Google and is used in its search and photo services. Including both scales in Casual Conversations v2 provides a clearer comparison with previous works that use the Fitzpatrick scale while also enabling measurement based on the more inclusive Monk scale.
“To increase nondiscrimination, fairness, and safety in AI, it’s important to have inclusive data and diversity within the data categories so researchers can better assess how well a specific model or AI-powered product is working for different demographic groups,” said Roy Austin, Vice President and Deputy General Counsel for Civil Rights at Meta. “This dataset has an important role in ensuring the technology we build has equity in mind for all from the outset."
Previously, researchers measuring algorithmic fairness and robustness typically identified gaps in models only for categories available in public datasets. For example, the categories of age, gender, and apparent skin tone typically support computer vision tasks, while language/dialect and voice timbre are used in audio/speech research.
Because we have expanded data collection with multiple categories and countries, we hope researchers can leverage this dataset and the expanded categories, such as apparent skin tone, disability, accent, dialect, location, and recording setup.
Because researchers train and evaluate AI systems with data, identifying fairness concerns and increasing robustness remains an urgent priority as AI increasingly affects individuals and communities. The goal for Casual Conversations has been to support more inclusive AI and technology by providing a standardized resource to identify ways in which models can perform more robustly.
Researchers and practitioners in dozens of countries have utilized the first version of Casual Conversations. At Meta, we proactively leverage this dataset along with other available datasets for model assessment in computer vision, language, and speech models. Given the positive reception from the AI community for the first version, we believe v2 addresses the large gap in data for which we are still in the early stages of recognizing.
While there is no single solution to address AI fairness and robustness, nor are there universally accepted metrics to detect fairness concerns, we are committed to collaborating with experts in the field to tackle these issues and hope to spark further research in these areas.
We would also like to thank Prof. Pascale Fung and Dr. Ellis Monk for their contributions.