The current public datasets for speech recognition don’t focus specifically on improving fairness. Our dataset includes 26,471 utterances in recorded speech by 593 people in the United States who were paid to record and submit audio of themselves saying commands. They self-identified their individual information, such as age, gender, ethnicity, geographic location and whether they consider themselves native English speakers.
The commands are categorized into seven domains: music, capture, utilities, notification control, messaging, calling and dictation. In response to prompts that relate to each of these domains, the participants provided their own audio commands. Some examples of prompts were asking how they would search for a song or make plans with friends about where to meet. Our dataset includes the audio and transcription of participants’ utterances.
By releasing this dataset, we hope to further motivate the AI community to make strides toward improving the fairness of speech recognition models, which will help all users have a better experience using applications with ASR.
ASR model evaluation for fairness & inclusion
Assist in measuring algorithmic fairness for age, gender, ethnicity, income variations & accents in various voice assistant domains.
Audio(wav)
Evaluation
Total number of unique speakers: 593
Total number of utterances: 26,471
Average per utterance length: 7.42 seconds
Labels (utterances)
Gender: (self-provided)
26,471
Age: (self-provided)
26,471
Ethnicity: (self-provided)
26,471
Socio-Economic Background: (self-provided)
26,471
Education: (self-provided)
26,471
First Language: (self-provided)
26,471
Audio recordings of paid individuals, who are asked to provide their “unscripted” voice commands and messages from a pre-approved list of topics (music, photo/video capture, utilities, notification control, calling, messaging, dictation) directed at a digital assistant.
Participants de-identified with unique numbers
Limited; see full license language for use
Open access
Data sources
Vendor data collection efforts
Geographic distribution
100% US (excludes speakers in IL, WA and TX)
Human Labels
Label types
Human transcription: free-form text
Labeling procedure - Human
Participants provided age, gender, ethnicity, education level, household income level, L1/L2 languages, accent labels by themselves
Spoken words are manually transcribed by human annotators
Human validated (transcription and PII)
Human validated (transcription and PII)
Human annotators flag PII
Human annotators transcribe speech audio that is clear, unambiguous, and can be transcribed with high confidence. Unintelligible data is flagged and is not transcribed.
All labels are transcribed by human annotators based outside the U.S.
Validators flag any PII content
Foundational models
Latest news
Foundational models