SPEECH & AUDIO

NLP

Seamless: Multilingual Expressive and Streaming Speech Translation

November 30, 2023

Abstract

Recent advancements in automatic speech translation have dramatically expanded language coverage, improved multimodal capabilities, and enabled a wide range of tasks and functionalities. That said, large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model—SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. The expanded version of SeamlessAlign adds 114,800 hours of automatically aligned data for a total of 76 languages. SeamlessM4T v2 provides the foundation on which our two newest models, SeamlessExpressive and SeamlessStreaming, are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one’s voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention (EMMA) mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To understand the performance of these models, we combined novel and modified versions of existing automatic metrics to evaluate prosody, latency, and robustness. For human evaluations, we adapted existing protocols tailored for measuring the most relevant attributes in the preservation of meaning, naturalness, and expressivity. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. In sum, Seamless gives us a pivotal look at the technical foundation needed to turn the Universal Speech Translator from a science fiction concept into a real-world technology. Finally, contributions in this work—including models, code, and a watermark detector—are publicly released and accessible at the link below.

Download the Paper

AUTHORS

Written by

Seamless Communication

Loïc Barrault

Yu-An Chung

Mariano Coria Meglioli

David Dale

Ning Dong

Mark Duppenthaler

Paul-Ambroise Duquenne

Brian Ellis

Hady Elsahar

Justin Haaheim

John Hoffman

Min-Jae Hwang

Hirofumi Inaguma

Christopher Klaiber

Ilia Kulikov

Pengwei Li

Daniel Licht

Jean Maillard

Ruslan Mavlyutov

Alice Rakotoarison

Kaushik Ram Sadagopan

Abinesh Ramakrishnan

Tuan Tran

Guillaume Wenzek

Yilin Yang

Ethan Ye

Ivan Evtimov

Pierre Fernandez

Cynthia Gao

Prangthip Hansanti

Elahe Kalbassi

Amanda Kallet

Artyom Kozhevnikov

Gabriel Mejia Gonzalez

Robin San Roman

Christophe Touret

Corinne Wong

Carleigh Wood

Bokai Yu

Pierre Andrews

Can Balioglu

Peng-Jen Chen

Marta R. Costa-jussà

Maha Elbayad

Hongyu Gong

Francisco Guzmán

Kevin Heffernan

Somya Jain

Justine Kao

Ann Lee

Xutai Ma

Alexandre Mourachko

Benjamin Peloquin

Juan Pino

Sravya Popuri

Christophe Ropers

Safiyyah Saleem

Holger Schwenk

Anna Sun

Paden Tomasello

Changhan Wang

Jeff Wang

Skyler Wang

Mary Williamson

Publisher

arXiv

Related Publications

December 17, 2024

NLP

FLAME : Factuality-Aware Alignment for Large Language Models

Jack Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Scott Yih, Xilun Chen

December 17, 2024

December 12, 2024

NLP

CORE MACHINE LEARNING

Memory Layers at Scale

Vincent-Pierre Berges, Barlas Oguz

December 12, 2024

December 12, 2024

NLP

Byte Latent Transformer: Patches Scale Better Than Tokens

Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srini Iyer

December 12, 2024

December 12, 2024

HUMAN & MACHINE INTELLIGENCE

NLP

Explore Theory-of-Mind: Program-Guided Adversarial Data Generation for Theory of Mind Reasoning

Melanie Sclar, Jane Yu, Maryam Fazel-Zarandi, Yulia Tsvetkov, Yonatan Bisk, Yejin Choi, Asli Celikyilmaz

December 12, 2024

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.