CoVoST V2: Expanding the largest, most diverse multilingual speech-to-text translation dataset

July 21, 2020

What the research is:

CoVoST V2 expands on our CoVoST dataset, a speech-to-text translation (ST) corpus targeted at multilingual translation. This new release makes available the largest multilingual ST dataset to date. CoVoST V2 will facilitate translating 21 languages into English, as well as English into 15 languages. In order to support wider research and applications in multilingual speech translation, we have released CoVoST V2 as free to use via a Creative Commons (CC0) license.

Developed in 2019, the initial version of CoVoST used Mozilla’s open source Common Voice database of crowdsourced voice recordings to create a corpus for translating 11 languages into English, with diverse speakers and accents. This first version was developed in order to foster research into many-to-one multilingual speech translation research, as previous datasets involved very specific domains, were low resource, or included only language pairs with English as source language.

How it works:

With more than 11,000 speakers and 60 accents represented, CoVoST V1 included a total of 708 hours of French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian, and Chinese samples. With V2, we’ve added speech translation data for Welsh, Catalan, Slovenian, Estonian, Indonesian, Arabic, Tamil, Portuguese, Latvian, and Japanese. In total, there are now 2,900 hours of speech represented in the corpus.

We chose the Common Voice database because it provides a wide variety of samples (different genders, age groups, and accents) and because every audio clip within the database is carefully vetted and validated by the Common Voice community. We then applied a series of our own checks in order to control the quality of translations.

We are providing baselines using the official training-development-test split on the following tasks: automatic speech recognition, machine translation, and speech translation. We also evaluated how accurate CoVoST V1 is in translating the same phrase across different speakers. The Tatoeba database provides the option of using additional data to evaluate models trained on CoVoST.

Why it matters:

With CoVoST V2, our aim is to foster research into massive multilingual speech translation and move toward a single model that covers many language pairs. Doing this will improve maintainability and quality — especially for the pairs with less data.

People on Facebook speak or read over 100 different languages. Our goal is to reduce the online communication barriers between different cultures. Unless someone is bilingual or multilingual, it can be difficult for them to efficiently or effectively share their voice with others.

We want no language left behind, and that’s why we’re open-sourcing CoVoST V2. With this and our own baseline results, we aim to lay the foundation for researchers and developers to create new tools to help eliminate language barriers and create a global communication experience for the people who use Facebook and the internet as a whole.

Read the full paper:

CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus