SeamlessM4T: The first, all-in-one, multimodal translation model
Another step forward in removing language barriers
Read the research paper
SeamlessM4T (Massive Multilingual Multimodal Machine Translation) is the first multimodal model representing a significant breakthrough in speech-to-speech and speech-to-text translation and transcription. Publicly-released under a CC BY-NC 4.0 license, the model supports nearly 100 languages for input (speech + text), 100 languages for text output and 35 languages (plus English) for speech output.
Overcoming the challenges of written and spoken communication
Existing translation systems have two shortcomings: limited language coverage, creating barriers for multilingual communication, and the reliance on multiple models, often causing translation errors, delays, and deployment complexities. SeamlessM4T addresses these challenges with its greater language coverage, accuracy and all-in-one model capabilities. These advances enable more effortless communication between people of different linguistic backgrounds, and greater translation capabilities from a model that can be used and built upon with ease.See the demo
Rather than relying on multiple, separate models, SeamlessM4T can perform multiple tasks across speech and text: speech-to-text, speech-to-speech, text-to-speech, text-to-text translation, and speech recognition. This single system approach reduces errors and delays, increasing the efficiency and quality of the translation process, bringing us closer to making seamless translation possible.
Multilingual speech generation
SeamlessM4T is the first, many-to-many direct speech-to-speech translation system. On the input side, the model supports up to 100 languages depending on the task. Additionally, SeamlessM4T implicitly recognizes the source language(s), without the need for a separate language identification model. Plus, as a unified model, it can reduce latency in comparison to cascaded systems.
High-quality, accurate translation
SeamlessM4T achieves state-of-the-art in quality for speech translation on multiple lengths of audio and text — a step change when compared to other leading direct systems. The model leverages Fairseq2, our newest modeling toolkit which was redesigned from scratch with speed and ease-of-use in mind.
SeamlessM4T also utilizes our SeamlessAlign corpus, the largest open dataset for multimodal translation to date, totaling 470k hours. This advance in multimodal data mining was achieved with SONAR, a new SOTA sentence embedding space for speech and text.
Evaluation beyond semantics
SeamlessM4T was thoroughly evaluated across all languages with both automatic metrics (ASR-BLEU, BLASER 2) and human evaluation. It was also tested for robustness, bias and added toxicity, where it significantly outperformed previous state-of-the-art models.
Explore the SeamlessM4T demo
Listen to some of the translations powered by the model across a selection of languages.
Modern Standard Arabic
How the model works
SeamlessM4T is divided into two components: encoders and decoders. Text and speech encoders recognize speech input in 100 languages. Decoders then transfer that meaning into nearly 100 languages for text and 35 (plus English) languages for speech.
Source speech languages: 100+1
Source text languages: 95+1
Target speech languages:35+1
Target text languages: 95+1
(+1 is for English)
Our unsupervised speech encoder learns to find structure and meaning in speech by listening to millions of hours of multilingual speech. The encoder takes the audio signal corresponding to the human speech and breaks it down into a sequence of speech segments, each representing a selection of sounds that make up human language, and then builds an internal representation of what is being said. Because spoken words are made up of many of those units, we use a length adaptor to roughly map them to actual words.
We have a text encoder that’s based on the NLLB model. It has been trained to understand text in 100 languages and produce representations that are useful for translation.
Our decoder is trained to take sequences of written words and speech units, and translate them into text. This can be applied to tasks in the same language, such as speech recognition, and multilingual translation tasks. With multitask training, we leverage the strengths of the strong text-to-text NLLB translation model to guide our speech-to-text translation model via token-level knowledge distillation.
We use acoustic units to represent speech on the target side. The text-to-unit component in the UnitY model generates these discrete speech units based on the text output and is pretrained on ASR data prior to UnitY fine-tuning. A multilingual unit HiFi-GAN vocoder is then used to convert these discrete units into audio waveforms.
Learn more about SeamlessM4T
Explore the multiple resources we have available for SeamlessM4T here.