CoVoST is a large-scale multilingual speech-to-text translation corpus based on the Common Voice project. It provides translations from English into 15 languages and from 21 languages into English, with a total of 78K speakers and 2,880 hours of speech. The data is available under a CC0 license.
End-to-end speech translation (ST) has recently witnessed an increased interest given its system simplicity, lower inference latency and less compounding errors compared to cascaded ST (speech recognition + machine translation). End-to-end ST model training, however, is often hampered by the lack of parallel data. While open datasets, such as MuST-C and Europarl-ST, have been developed to alleviate this issue, current datasets cover a limited number of languages.
With the aim to foster research in massive multilingual ST and ST for low resource language pairs, we've released CoVoST. It provides translations from English into 15 languages---Arabic, Catalan, Welsh, German, Estonian, Persian, Indonesian, Japanese, Latvian, Mongolian, Slovenian, Swedish, Tamil, Turkish, Chinese---and from 21 languages into English, including the 15 target languages as well as Spanish, French, Italian, Dutch, Portuguese, Russian. It has a total of 78K speakers and 2,880 hours of speech. This represents the largest open dataset available to date from a total volume and language coverage perspective.
We've released the data under CC0 license. Please see the instructions for download and use.