The current public datasets for speech recognition don’t focus specifically on improving fairness. Our dataset includes approximately 27000 utterances in recorded speech by 600 people in the United States who were paid to record and submit audio of themselves saying commands. They self-identified their individual information, such as age, gender, ethnicity, geographic location and whether they consider themselves native English speakers.
The commands are categorized into seven domains: music, capture, utilities, notification control, messaging, calling and dictation. In response to prompts that relate to each of these domains, the participants provided their own audio commands. Some examples of prompts were asking how they would search for a song or make plans with friends about where to meet. Our dataset includes the audio and transcription of participants’ utterances.
By releasing this dataset, we hope to further motivate the AI community to make strides toward improving the fairness of speech recognition models, which will help all users have a better experience using applications with ASR.
ASR model evaluation for fairness & inclusion
Assist in measuring algorithmic fairness for age, gender, ethnicity, income variations & accents in various voice assistant domains.
Audio(wav)
Evaluation
Total number of unique speakers: 595
Total number of utterances: 27,055
Average per utterance length: 7.42 seconds
Labels (utterances)
Gender: (self-provided)
27,055
Age: (self-provided)
27,055
Ethnicity: (self-provided)
27,055
Socio-Economic Background: (self-provided)
27,055
Education: (self-provided)
27,055
First Language: (self-provided)
27,055
Audio recordings of paid individuals, who are asked to provide their “unscripted” voice commands and messages from a pre-approved list of topics (music, photo/video capture, utilities, notification control, calling, messaging, dictation) directed at a digital assistant.
Participants de-identified with unique numbers
Limited; see full license language for use
Open access
Data sources
Vendor data collection efforts
Geographic distribution
100% US (excludes speakers in IL, WA and TX)
Human Labels
Label types
Human transcription: free-form text
Labeling procedure - Human
Participants provided age, gender, ethnicity, education level, household income level, L1/L2 languages, accent labels by themselves
Spoken words are manually transcribed by human annotators
Human validated (transcription and PII)
Human validated (transcription and PII)
Human annotators flag PII
Human annotators transcribe speech audio that is clear, unambiguous, and can be transcribed with high confidence. Unintelligible data is flagged and is not transcribed.
All labels are transcribed by human annotators based outside the U.S.
Validators flag any PII content
Who We Are
Our Actions
Newsletter