SPEECH & AUDIO

NLP

Seamless: Multilingual Expressive and Streaming Speech Translation

November 30, 2023

Abstract

Recent advancements in automatic speech translation have dramatically expanded language coverage, improved multimodal capabilities, and enabled a wide range of tasks and functionalities. That said, large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model—SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. The expanded version of SeamlessAlign adds 114,800 hours of automatically aligned data for a total of 76 languages. SeamlessM4T v2 provides the foundation on which our two newest models, SeamlessExpressive and SeamlessStreaming, are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one’s voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention (EMMA) mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To understand the performance of these models, we combined novel and modified versions of existing automatic metrics to evaluate prosody, latency, and robustness. For human evaluations, we adapted existing protocols tailored for measuring the most relevant attributes in the preservation of meaning, naturalness, and expressivity. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. In sum, Seamless gives us a pivotal look at the technical foundation needed to turn the Universal Speech Translator from a science fiction concept into a real-world technology. Finally, contributions in this work—including models, code, and a watermark detector—are publicly released and accessible at the link below.

Download the Paper

AUTHORS

Written by

Seamless Communication

Loïc Barrault

Yu-An Chung

Mariano Coria Meglioli

David Dale

Ning Dong

Mark Duppenthaler

Paul-Ambroise Duquenne

Brian Ellis

Hady Elsahar

Justin Haaheim

John Hoffman

Min-Jae Hwang

Hirofumi Inaguma

Christopher Klaiber

Ilia Kulikov

Pengwei Li

Daniel Licht

Jean Maillard

Ruslan Mavlyutov

Alice Rakotoarison

Kaushik Ram Sadagopan

Abinesh Ramakrishnan

Tuan Tran

Guillaume Wenzek

Yilin Yang

Ethan Ye

Ivan Evtimov

Pierre Fernandez

Cynthia Gao

Prangthip Hansanti

Elahe Kalbassi

Amanda Kallet

Artyom Kozhevnikov

Gabriel Mejia Gonzalez

Robin San Roman

Christophe Touret

Corinne Wong

Carleigh Wood

Bokai Yu

Pierre Andrews

Can Balioglu

Peng-Jen Chen

Marta R. Costa-jussà

Maha Elbayad

Hongyu Gong

Francisco Guzmán

Kevin Heffernan

Somya Jain

Justine Kao

Ann Lee

Xutai Ma

Alexandre Mourachko

Benjamin Peloquin

Juan Pino

Sravya Popuri

Christophe Ropers

Safiyyah Saleem

Holger Schwenk

Anna Sun

Paden Tomasello

Changhan Wang

Jeff Wang

Skyler Wang

Mary Williamson

Publisher

arXiv

Related Publications

April 14, 2024

SPEECH & AUDIO

NLP

CoLLD: Contrastive Layer-to-Layer Distillation for Compressing Multilingual Pre-Trained Speech Encoders

Heng-Jui Chang, Ning Dong (AI), Ruslan Mavlyutov, Sravya Popuri, Andy Chung

April 14, 2024

February 21, 2024

INTEGRITY

NLP

Watermarking Makes Language Models Radioactive

Tom Sander, Pierre Fernandez, Alain Durmus, Matthijs Douze, Teddy Furon

February 21, 2024

December 11, 2023

SPEECH & AUDIO

Audiobox: Unified Audio Generation with Natural Language Prompts

Wei-Ning Hsu, Akinniyi Akinyemi, Alice Rakotoarison, Andros Tjandra, Apoorv Vyas, Baishan Guo, Bapi Akula, Bowen Shi, Brian Ellis, Ivan Cruz, Jeff Wang, Jiemin Zhang, Mary Williamson, Matt Le, Rashel Moritz, Robbie Adkins, William Ngan, Xinyue Zhang, Yael Yungster, Yi-Chiao Wu

December 11, 2023

December 07, 2023

CONVERSATIONAL AI

NLP

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Davide Testuggine, Madian Khabsa

December 07, 2023

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.