November 30, 2023
Recent advancements in automatic speech translation have dramatically expanded language coverage, improved multimodal capabilities, and enabled a wide range of tasks and functionalities. That said, large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model—SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. The expanded version of SeamlessAlign adds 114,800 hours of automatically aligned data for a total of 76 languages. SeamlessM4T v2 provides the foundation on which our two newest models, SeamlessExpressive and SeamlessStreaming, are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one’s voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention (EMMA) mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To understand the performance of these models, we combined novel and modified versions of existing automatic metrics to evaluate prosody, latency, and robustness. For human evaluations, we adapted existing protocols tailored for measuring the most relevant attributes in the preservation of meaning, naturalness, and expressivity. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. In sum, Seamless gives us a pivotal look at the technical foundation needed to turn the Universal Speech Translator from a science fiction concept into a real-world technology. Finally, contributions in this work—including models, code, and a watermark detector—are publicly released and accessible at the link below.
Written by
Seamless Communication
Loïc Barrault
Yu-An Chung
Mariano Coria Meglioli
David Dale
Ning Dong
Mark Duppenthaler
Paul-Ambroise Duquenne
Brian Ellis
Hady Elsahar
Justin Haaheim
John Hoffman
Min-Jae Hwang
Hirofumi Inaguma
Christopher Klaiber
Ilia Kulikov
Pengwei Li
Daniel Licht
Jean Maillard
Ruslan Mavlyutov
Alice Rakotoarison
Kaushik Ram Sadagopan
Abinesh Ramakrishnan
Tuan Tran
Guillaume Wenzek
Yilin Yang
Ethan Ye
Ivan Evtimov
Pierre Fernandez
Cynthia Gao
Prangthip Hansanti
Elahe Kalbassi
Amanda Kallet
Artyom Kozhevnikov
Gabriel Mejia Gonzalez
Robin San Roman
Christophe Touret
Corinne Wong
Carleigh Wood
Bokai Yu
Pierre Andrews
Can Balioglu
Peng-Jen Chen
Marta R. Costa-jussà
Kevin Heffernan
Somya Jain
Justine Kao
Xutai Ma
Alexandre Mourachko
Benjamin Peloquin
Sravya Popuri
Christophe Ropers
Safiyyah Saleem
Anna Sun
Paden Tomasello
Jeff Wang
Skyler Wang
Mary Williamson
Publisher
arXiv
November 20, 2024
Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, Kimish Patel, Zechun Liu, Yangyang Shi, Tijmen Blankevoort, Mahesh Pasupuleti, Bilge Soran, Zacharie Delpierre Coudert, Rachad Alao, Raghuraman Krishnamoorthi, Vikas Chandra
November 20, 2024
November 19, 2024
Shehzaad Dhuliawala, Ilia Kulikov, Ping Yu, Asli Celikyilmaz, Jason Weston, Sainbayar Sukhbaatar, Jack Lanchantin
November 19, 2024
November 14, 2024
Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, Xujie Si
November 14, 2024
October 16, 2024
Movie Gen Team
October 16, 2024
Foundational models
Latest news
Foundational models