AI translation

SeamlessM4T: The first, all-in-one, multimodal translation model

SeamlessM4T is a foundational speech/text translation and transcription model that overcomes the limitations of previous systems with state-of-the-art results.

Download the model from GitHub

The research

Another step forward in removing language barriers

Read the research paper

SeamlessM4T (Massive Multilingual Multimodal Machine Translation) is the first multimodal model representing a significant breakthrough in speech-to-speech and speech-to-text translation and transcription. Publicly-released under a CC BY-NC 4.0 license, the model supports nearly 100 languages for input (speech + text), 100 languages for text output and 35 languages (plus English) for speech output.


SeamlessM4T draws on the findings and capabilities from Meta’s No Language Left Behind (NLLB), Universal Speech Translator, and Massively Multilingual Speech advances, all from a single model.


Key breakthroughs

Overcoming the challenges of written and spoken communication

Existing translation systems have two shortcomings: limited language coverage, creating barriers for multilingual communication, and the reliance on multiple models, often causing translation errors, delays, and deployment complexities. SeamlessM4T addresses these challenges with its greater language coverage, accuracy and all-in-one model capabilities. These advances enable more effortless communication between people of different linguistic backgrounds, and greater translation capabilities from a model that can be used and built upon with ease.

See the demo

A multimodal,
multitasking model

Rather than relying on multiple, separate models, SeamlessM4T can perform multiple tasks across speech and text: speech-to-text, speech-to-speech, text-to-speech, text-to-text translation, and speech recognition. This single system approach reduces errors and delays, increasing the efficiency and quality of the translation process, bringing us closer to making seamless translation possible.


Multilingual speech generation

SeamlessM4T is the first, many-to-many direct speech-to-speech translation system. On the input side, the model supports up to 100 languages depending on the task. Additionally, SeamlessM4T implicitly recognizes the source language(s), without the need for a separate language identification model. Plus, as a unified model, it can reduce latency in comparison to cascaded systems.


High-quality, accurate translation

SeamlessM4T achieves state-of-the-art in quality for speech translation on multiple lengths of audio and text — a step change when compared to other leading direct systems. The model leverages Fairseq2, our newest modeling toolkit which was redesigned from scratch with speed and ease-of-use in mind.

SeamlessM4T also utilizes our SeamlessAlign corpus, the largest open dataset for multimodal translation to date, totaling 470k hours. This advance in multimodal data mining was achieved with SONAR, a new SOTA sentence embedding space for speech and text.

Evaluation beyond semantics

SeamlessM4T was thoroughly evaluated across all languages with both automatic metrics (ASR-BLEU, BLASER 2) and human evaluation. It was also tested for robustness, bias and added toxicity, where it significantly outperformed previous state-of-the-art models.

Explore the SeamlessM4T demo

Try the demo to experience the model’s capabilities firsthand. Simply record an audio clip as an input, select a language for the output, then see the results.

Try it now

Listen to some of the translations powered by the model across a selection of languages.

English

French

Korean

Romanian

Italian

Modern Standard Arabic

Western Persian

How the model works

SeamlessM4T is divided into two components: encoders and decoders. Text and speech encoders recognize speech input in 100 languages. Decoders then transfer that meaning into nearly 100 languages for text and 35 (plus English) languages for speech.

STATS

Source speech languages: 100+1
Source text languages: 95+1
Target speech languages:35+1
Target text languages: 95+1
(+1 is for English)

Our unsupervised speech encoder learns to find structure and meaning in speech by listening to millions of hours of multilingual speech. The encoder takes the audio signal corresponding to the human speech and breaks it down into a sequence of speech segments, each representing a selection of sounds that make up human language, and then builds an internal representation of what is being said. Because spoken words are made up of many of those units, we use a length adaptor to roughly map them to actual words.

We have a text encoder that’s based on the NLLB model. It has been trained to understand text in 100 languages and produce representations that are useful for translation.

Our decoder is trained to take sequences of written words and speech units, and translate them into text. This can be applied to tasks in the same language, such as speech recognition, and multilingual translation tasks. With multitask training, we leverage the strengths of the strong text-to-text NLLB translation model to guide our speech-to-text translation model via token-level knowledge distillation.

We use acoustic units to represent speech on the target side. The text-to-unit component in the UnitY model generates these discrete speech units based on the text output and is pretrained on ASR data prior to UnitY fine-tuning. A multilingual unit HiFi-GAN vocoder is then used to convert these discrete units into audio waveforms.

Resources

Learn more about SeamlessM4T

Explore the multiple resources we have available for SeamlessM4T here.

Download the model from GitHub
Read the research paper
Try out the Demo
Read the AI at Meta blog post