Casual Conversations dataset version 2 is designed to help researchers evaluate their computer vision, audio and speech models for accuracy across a diverse set of ages, genders, language/dialects, geographies, disabilities, physical adornments, physical attributes, voice timbres, skin tones, activities, and recording setups.
Casual Conversations v2 is composed of over 5,567 participants (26,467 videos) and intended mainly to be used for assessing the performance of already trained models in computer vision and audio applications for the purposes permitted in our data license agreement. The videos feature paid individuals who agreed to participate in the project and explicitly provided Age, Gender, Language/Dialect, Geo-location, Disability, Physical adornments, Physical attributes labels themselves. The videos were recorded in Brazil, India, Indonesia, Mexico, Philippines, United States, and Vietnam with a diverse set of adults in various categories. A group of trained annotators labeled the participants’ apparent skin tone using the Fitzpatrick scale and Monk Scale, in addition to annotations of Voice timbre, Activity and Recording setups. Spoken words in all videos are either scripted (a sample paragraph from The Idiot by Fyodor Dostoevsky provided with the dataset) or nonscripted (answering one of five predetermined questions).
Casual Conversations v2 dataset is designed to help researchers evaluate their computer vision, audio and speech models for accuracy across certain attributes.
Machine learning, ML Fairness and Robustness
Assist in measuring algorithmic fairness and robustness in terms of age, gender, apparent skin tone, language/dialect, geo-location, disability, physical adornment, physical attributes, voice timbre, activity/recording setup conditions.
Video (mp4)
Testing, training for certain categories (without using provided annotations)
Total number of subjects/actors: 5,567
Total number of video recordings: 26,467
Average per video length: ~1 Minute
Labels
Age (self-provided)
5,567
Gender (self-provided)
5,567
Language/Dialect (self-provided)
5,567
Geo-location (self-provided)
5,567
Disability (self-provided)
5,567
Physical adornment (self-provided)
5,567
Physical attributes (self-provided)
5,567
Voice timbre (human labeled)
5,567
Apparent skin tone (human labeled)
5,567
Activity (human labeled)
11,056
Recording setup (human labeled)
26,467
Video recordings of individuals, who are asked predetermined questions from a pre-approved list, to provide their nonscripted answer as well as video recordings of their reading from a scripted text
Participants de-identified with unique numbers
Limited; see full license language for use
Summary of license permissions
You can evaluate models on the provided labels
You can only train your model on certain labels - refer to license permissions
Open access
Data sources
Vendor data collection efforts
Data selection
Human validators flagged personally identifiable information (PII)
All videos are provided by the participants for the purpose of creating this dataset.
Unsampled
Geographic distribution
Brazil, India, Indonesia, Mexico, Philippines, United States, Vietnam
Human Labels
Label types
Human-labels: free-form text labels
Labeling procedure - Human
Participants provided age, gender, language, disability, geo-location, physical adornment and attributes labels
Annotators labeled for apparent skin tone, voice timbre, activity/recording setups
Human validated
Human validated
Human validators verify labels
Human validators flag PII
Human validators filter data
All labels are verified by human validators
Validators flag any PII content
The AI research community can use Casual Conversations v2 as one important step toward promoting fairness and robustness research. With Casual Conversations v2, we hope to spur further research in this important, emerging field.
If you are an individual who appears in this dataset and would like for your data to be removed from this dataset, please contact: casualconversations@meta.com
Foundational models
Latest news
Foundational models