June 13, 2022
To make it possible for people to easily understand each other while speaking in different languages, we need more than just text-based translation systems. But the conventional approach to building speech-to-speech translation systems has faced two significant shortcomings. It uses a cascaded series of steps — speech recognition, then text-to-text translation, and finally conversion of translated text back to speech — where the computational costs and inference latency accumulate in each stage. In addition, more than 40 percent of the world’s languages are without text writing systems, making this approach infeasible for extending translations to every spoken language.
To enable faster inference and support translation between unwritten languages, Meta AI is sharing new work on our direct speech-to-speech translation (S2ST) approach, which does not rely on text generation as an intermediate step. Our method outperforms previous approaches and is the first direct S2ST system trained on real-world open sourced audio data instead of synthetic audio for multiple language pairs.
Recent speech-to-speech modeling work takes the same approach as traditional text-to-speech synthesis. These models directly translate source speech into target speech spectrograms, which are the spectrum of frequencies represented as multidimensional continuous-value vectors. It can be difficult to train translation models using speech spectrograms as the target, however, because they must learn several different aspects of the relationship between two languages. (How they align with one another, for example, and how their acoustic and linguistic characteristics compare.)
Instead of spectrograms, we use discretized speech units obtained from the clustering of self-supervised speech representations. Compared with spectrograms, discrete units can disentangle linguistic content from prosodic speech information and take advantage of existing natural language processing modeling techniques. Using discretized speech units, we’ve produced three notable advancements: Our S2ST system outperforms previous direct S2ST systems; it is the first direct S2ST system trained on real S2ST data for multiple language pairs; and it leverages pretraining with unlabeled speech data.
To facilitate direct speech-to-speech translation with discrete units (audio samples), we use self-supervised discrete units as targets (speech-to-unit translation, or S2UT) for training the direct S2ST system. In the graphic below, we propose a transformer-based sequence-to-sequence model with a speech encoder and a discrete unit decoder that incorporates auxiliary tasks (shown in dashed lines).
We perform our experiments using the Fisher Spanish-English speech translation corpus consisting of 139K sentences (approximately 170 hours) from telephone conversations in Spanish and the corresponding Spanish and English text transcriptions. We use a high-quality in-house text-to-speech engine to prepare synthetic target speech with a single female voice as the training target. All our experiments — including the baselines — are performed with the synthetic target speech and do not rely on the TTS engine for other uses. The proposed system can be trained in a textless setup by using discrete units in the source language as the auxiliary task target, which helps it achieve significant improvement compared with previous work. Using discrete units yields an improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features.
Due to the lack of parallel S2ST training data, previous work on direct S2ST mainly relies on TTS to generate synthetic target speech for model training, which is impractical for supporting languages without a standard text writing system. Given the recent release of large-scale S2ST data from the FAIR team at Meta AI, in “Textless speech-to-speech translation on real data” (audio samples), we train our proposed S2UT system on real data from VoxPopuli S2S data (download) and automatically mined S2S data (download) without any extra text supervision. The key is a speech normalization technique that can be trained with as little as one hour of speech data. This method removes the variations in real target speech from multiple speakers without changing the lexical content and leads to improvement in the S2UT performance compared to unnormalized targets.
Additionally, our best textless direct speech translation model achieves similar performance to that of cascaded text-based systems without needing human annotations for building the ASR models to transcribe target speech. Further incorporating automatically mined S2ST data during training shows an additional 2.0 BLEU gain. This is the first time a textless S2ST system has been successfully trained with publicly available real-world data on multiple languages while also showing competitive results. We believe it is also the first empirical study to demonstrate the usefulness of the mined S2ST data.
Lastly, we continue to improve upon the S2UT performance of the systems in the previous two papers through pretraining with unlabeled speech data in “Enhanced direct speech-to-speech translation using self-supervised pretraining and data augmentation” (audio samples). We show that pretraining inspirations from state-of-the-art speech-to-text translation (S2T) systems can transfer well to direct S2ST with the use of discrete units as target, bringing at least 6.5 BLEU gain and bridging or even exceeding the performance of cascaded systems. Furthermore, we augment the training data with weakly supervised data generated from more than 1K hours of speech, which leads to an additional 2.7 BLEU gain. Our effort opens up a path for future speech-to-speech translation research to further improve translation quality and produce more seamless communication experiences for users.
Spanish to English: Source and translation
Direct speech-to-speech modeling with discrete units presents an exciting future for building better translation systems. Beyond just translation quality, benchmarks also show that our proposed system is the most efficient in terms of runtime, FLOPS, and max memory compared with spectrogram-based S2ST systems and cascaded systems.
The work discussed here also moves us closer to translation systems that work well for unwritten languages, which remain popular for dialects all over the world and are largely unsupported. With the release of our papers and code, we hope to enable future direct speech-to-speech translation advancements across the research community. Our evaluations are done with open sourced models. We hope that our measurement protocols can be leveraged so that all future progress can be compared fairly and openly.
This series of work discussed in this blog post is made possible by Yossi Adi, Peng-Jen Chen, Paul-Ambroise Duquenne, Hongyu Gong, Jiatao Gu, Qing He, Wei-Ning Hsu, Ann Lee, Xutai Ma, Juan Pino, Adam Polyak, Sravya Popuri, Holger Schwenk, Yun Tang, Changhan Wang (listed in alphabetical order). We thank Necip Fazil Ayan, Brian Bui, Andy Chung, Jade Copet, Ning Dong, Emmanuel Dupoux, Hirofumi Inaguma, Semarley Jarrett, Justine Kao, Evgeny Kharitonov, Felix Kreuk, Ilia Kulikov, Kushal Lakhotia, Abdelrahman Mohamed, Tu Anh Nguyen, Brian O'Horo, Gustavo Gandia Rivera, Morgane Rivière, Chris Summers, Jeff Wang, Carleigh Wood, Ethan Ye and Al Youngblood (listed in alphabetical order) for their support and discussions of this work.