November 29, 2023
Massively multilingual and multimodal sentence representations like SONAR are usually trained to capture only the meaning of the encoded text or speech. We complement this semantic embedding by a generic speech characteristic embedding which captures the expressive properties of a speech signal. We describe an iterative training procedure which aims to disentangle the semantics and expressive speech properties, and which does not need labeled data. We show the effectiveness of our method on the FLEURS and mExpresso benchmark test sets using multiple metrics which aim to measure the preservation of the meaning and prosody for zero-shot speech-to-speech translation from five languages into English.
Written by
Paul-Ambroise Duquenne
Kevin Heffernan
Alexandre Mourachko
Benoit Sagot (INRIA)
Publisher
arXiv
Research Topics
June 05, 2026
Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava
June 05, 2026
May 20, 2026
Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra, Charles-Eric Saint-James, Rashel Moritz, Sheila Krogh-Jespersen, Vanessa Stark, Surya Parimi, Jiayi Shen, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Tom Fizycki, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Juan Pino, Michael C. Frank, Emmanuel Dupoux
May 20, 2026
May 18, 2026
Rohit Patel, Alexandre Rezende, Steven McClain
May 18, 2026
May 12, 2026
Jean Remi King, Corentin Bel, Linnea Evanson, Julien Gadonneix, Sophia Houhamdi, Jarod Levy, Josephine Raugel, Andrea Santos Revilla, Mingfang (Lucy) Zhang, Julie Bonnaire, Charlotte Caucheteux, Alexandre Défossez, Théo Desbordes, Pablo Diego-Simón, Shubh Khanna, Juliette Millet, Pierre Orhan, Saarang Panchavati, Antoine Ratouchniak, Alexis Thual, Teon Brooks, Katelyn Begany, Yohann Benchetrit, Marlene Careil, Hubert Jacob Banville, Stéphane d'Ascoli, Simon Dahan, Jérémy Rapin
May 12, 2026

Our approach
Latest news
Foundational models