Overview

The current public datasets for speech recognition don’t focus specifically on improving fairness. Our dataset includes 26,471 utterances in recorded speech by 593 people in the United States who were paid to record and submit audio of themselves saying commands. They self-identified their individual information, such as age, gender, ethnicity, geographic location and whether they consider themselves native English speakers.

The commands are categorized into seven domains: music, capture, utilities, notification control, messaging, calling and dictation. In response to prompts that relate to each of these domains, the participants provided their own audio commands. Some examples of prompts were asking how they would search for a song or make plans with friends about where to meet. Our dataset includes the audio and transcription of participants’ utterances.

Fair-speech dataset for assistant domains

By releasing this dataset, we hope to further motivate the AI community to make strides toward improving the fairness of speech recognition models, which will help all users have a better experience using applications with ASR.


Key Application

ASR model evaluation for fairness & inclusion

Intended Use Cases

Assist in measuring algorithmic fairness for age, gender, ethnicity, income variations & accents in various voice assistant domains.

Primary Data Type

Audio(wav)

Data Function

Evaluation

Dataset Characteristics

Total number of unique speakers: 593

Total number of utterances: 26,471

Average per utterance length: 7.42 seconds

Labels (utterances)

Gender: (self-provided)

26,471

Age: (self-provided)

26,471

Ethnicity: (self-provided)

26,471

Socio-Economic Background: (self-provided)

26,471

Education: (self-provided)

26,471

First Language: (self-provided)

26,471

Nature Of Content

Audio recordings of paid individuals, who are asked to provide their “unscripted” voice commands and messages from a pre-approved list of topics (music, photo/video capture, utilities, notification control, calling, messaging, dictation) directed at a digital assistant.

Privacy PII

Participants de-identified with unique numbers

License

Limited; see full license language for use

Access Cost

Open access

Data Collection

Data sources

Vendor data collection efforts

Geographic distribution

100% US (excludes speakers in IL, WA and TX)

Labelling Methods

Human Labels

Label types

Human transcription: free-form text

Labeling procedure - Human

Participants provided age, gender, ethnicity, education level, household income level, L1/L2 languages, accent labels by themselves

Spoken words are manually transcribed by human annotators

Validation Methods

Human validated (transcription and PII)

Validator description(s)

Human validated (transcription and PII)

Validation tasks

Human annotators flag PII

Human annotators transcribe speech audio that is clear, unambiguous, and can be transcribed with high confidence. Unintelligible data is flagged and is not transcribed.

Validation policy summary

All labels are transcribed by human annotators based outside the U.S.

Validators flag any PII content