Seamless Communication | AI research by Meta
Unlocking new AI translation capabilities with a suite of publicly available models
A unified model
Seamless merges the quality and multilinguality of SeamlessM4T v2, the low latency of SeamlessStreaming and the expression preservation of SeamlessExpressive into one unified system. It’s the first streaming translation model to maintain both vocal style and prosody, which can be particularly challenging in streaming, where the system only has access to partial input.
Cross-lingual prosody and vocal style transfer
In an effort to preserve the speaker’s vocal style across languages, we incorporate an expressivity encoder into the SeamlessM4T v2 foundational model. This process ensures unit generation is guided by intended speech rate and rhythm. Additionally, replacing the HiFi-GAN unit vocoder in SeamlessM4T v2 with an expressive unit-to-speech generator that is conditioned on the source speech allows for seamless transfer of tones, emotional expression and vocal styles.
Building upon our past work with WikiMatrix, CCMatrix, NLLB, SpeechMatrix and SeamlessM4T, we’re introducing the first expressive speech alignment procedure, a procedure which we also used to create SeamlessExpressive. Starting with raw data, the expressive alignment procedure automatically discovers pairs of audio segments sharing not only the same meaning, but the same overall expressivity. To showcase this procedure, we are making metadata available to create a benchmarking dataset called SeamlessAlignExpressive, that can be used to validate the quality of our alignment method. SeamlessAlignExpressive is the first large-scale collection of multilingual audio alignments for expressive translation for benchmarking.
Our state-of-the-art streaming model, SeamlessStreaming is able to intelligently decide when it has enough context to output the next target text or speech segment. It does so through a learned read/write policy, which determines based on partial audio input, whether it should “write” and generate output or “read” and continue waiting for more input. The model automatically adapts to different language structures, enabling stronger performance across many different language pairs.
Translating with high-quality and high-accuracy
The upgraded foundational multilingual and multitask model, SeamlessM4T v2, features a non-autoregressive text-to-unit decoder. The w2v-BERT 2.0 encoder is trained on 4.5 million hours of speech data, compared to the previous version which was trained on 1 million hours. Additionally, SeamlessM4T v2 is supplemented with more data from SeamlessAlign for low resource languages.
SeamlessM4T v2 was thoroughly evaluated across all tasks and languages with automatic metrics (BLEU, ASR-BLEU, BLASER 2, etc.), where it significantly outperformed previous state-of-the-art models. It was also tested for robustness, bias and hallucinated toxicity.
Source speech languages: 100+1
Source text languages: 95+1
Target speech languages:35+1
Target text languages: 95+1
(+1 is for English)
Download the Seamless Communication modelsGet started
In keeping with our approach to open science, we’re publicly releasing the full suite of Seamless Communication models, along with metadata, data and tools to allow the research community to build on this work.
The family of Seamless Communication models includes:
Safety and responsibility
Communication can suffer when ideas are mis-translated, which is why we are dedicated to promoting a safer and more responsible AI ecosystem.
In order to confirm authenticity, all generated audio outputs from our expressive models include watermarking. When creating a translation, an inaudible signature is added to the generated audio signal for tracing and auditability, enhancing safety. Our approach can watermark shorter segments and is more robust than current state-of-the-art methods.
English input: whisper
Please keep the volume down. We just put the baby to sleep.
Spanish output: without watermark
Spanish output: with watermark
Reducing hallucinated toxicity
Building upon our previous work on toxic hallucination detection, we developed a new method that meaningfully reduces hallucinated toxicity in translation outputs (i.e. toxic words that were not present in the input). This new method works at inference time; it can be applied to any translation model without retraining or performance loss.
Explore the SeamlessExpressive demo yourself
Try the SeamlessExpressive demo to hear how you sound in a different language while maintaining elements of your expression and tone.SeamlessExpressive demo
More on Seamless Communication