SPEECH & AUDIO

NLP

SeamlessM4T—Massively Multilingual & Multimodal Machine Translation

August 22, 2023

Abstract

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems composed of multiple subsystems performing translation progressively, putting scalable and high-performing unified speech translation systems out of reach. To address these gaps, we introduce SeamlessM4T—Massively Multilingual & Multimodal Machine Translation—a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations, dubbed SeamlessAlign. Filtered and combined with human labeled and pseudo-labeled data (totaling 406,000 hours), we developed the first multilingual system capable of translating from and into English for both speech and text. On Fleurs, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous state-of-the-art in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. On CVSS and compared to a 2-stage cascaded model for speech-to-speech translation, SeamlessM4T-Large’s performance is stronger by 58%. Preliminary human evaluations of speech-to-text translation outputs evinced similarly impressive results; for translations from English, XSTS scores for 24 evaluated languages are consistently above 4 (out of 5). For into English directions, we see significant improvement over WhisperLarge-v2’s baseline for 7 out of 24 languages. To further evaluate our system, we developed Blaser 2.0, which enables evaluation across speech and text with similar accuracy compared to its predecessor when it comes to quality estimation. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks (average improvements of 38% and 49%, respectively) compared to the current state-of-the-art model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Compared to the state-of-the-art, we report up to 63% of reduction in added toxicity in our translation outputs. Finally, all contributions in this work—including models, inference code, finetuning recipes backed by our improved modeling toolkit Fairseq2, and metadata to recreate the unfiltered 470,000 hours of SeamlessAlign — are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication.

Download the Paper

AUTHORS

Written by

Seamless Communication

Loic Barrault

Andy Chung

David Dale

Ning Dong (AI)

Paul-Ambroise Duquenne

Hady Elsahar

Hongyu Gong

Kevin Heffernan

John Hoffman

Christopher Klaiber

Peng-Jen Chen

Daniel Licht

Jean Maillard

Alice Rakotoarison

Kaushik Ram Sadagopan

Guillaume Wenzek

Abinesh Ramakrishnan

Alexandre Mourachko

Amanda Kallet

Ann Lee

Anna Sun

Bapi Akula

Benjamin Peloquin

Bernie Huang

Bokai Yu

Brian Ellis

Can Balioglu

Carleigh Wood

Changhan Wang

Christophe Ropers

Cynthia Gao

Daniel Li (FAIR)

Elahe Kalbassi

Ethan Ye

Gabriel Mejia Gonzalez

Hirofumi Inaguma

Holger Schwenk

Igor Tufanov

Ilia Kulikov

Janice Lam

Jeff Wang (PM - AI)

Juan Pino

Justin Haaheim

Justine Kao

Prangthip Hasanti

Kevin Tran

Maha Elbayad

Marta R. Costa-jussa

Mohamed Ramadan

Naji El Hachem

Onur Çelebi

Paco Guzmán

Paden Tomasello

Pengwei Li

Pierre Andrews

Ruslan Mavlyutov

Russ Howes

Safiyyah Saleem

Skyler Wang

Somya Jain

Sravya Popuri

Tuan Tran

Vish Vogeti

Xutai Ma

Yilin Yang

Publisher

Meta AI

Related Publications

October 16, 2024

SPEECH & AUDIO

COMPUTER VISION

Movie Gen: A Cast of Media Foundation Models

Movie Gen Team

October 16, 2024

October 04, 2024

HUMAN & MACHINE INTELLIGENCE

CONVERSATIONAL AI

Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents

Bandhav Veluri, Benjamin Peloquin, Bokai Yu, Hongyu Gong, Shyam Gollakota

October 04, 2024

October 03, 2024

NLP

BLASER 2.0: a metric for evaluation and quality estimation of massively multilingual speech and text translation

David Dale, Marta R. Costa-jussa

October 03, 2024

September 26, 2024

SPEECH & AUDIO

NLP

Unveiling the Role of Pretraining in Direct Speech Translation

Belen Alastruey, Gerard I. Gállego, Marta R. Costa-jussa

September 26, 2024

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.