SPEECH & AUDIO

Generative Pre-training for Speech with Flow Matching

March 05, 2024

Abstract

Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples here generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specificall, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.

Download the Paper

AUTHORS

Written by

Alex Liu

Matt Le

Apoorv Vyas

Bowen Shi

Andros Tjandra

Wei-Ning Hsu

Publisher

ICLR

Research Topics

Speech & Audio

Related Publications

May 12, 2026

HUMAN & MACHINE INTELLIGENCE

RESEARCH

NeuralSet: A High-Performing Python Package for Neuro-AI

Jean Remi King, Corentin Bel, Linnea Evanson, Julien Gadonneix, Sophia Houhamdi, Jarod Levy, Josephine Raugel, Andrea Santos Revilla, Mingfang (Lucy) Zhang, Julie Bonnaire, Charlotte Caucheteux, Alexandre Défossez, Théo Desbordes, Pablo Diego-Simón, Shubh Khanna, Juliette Millet, Pierre Orhan, Saarang Panchavati, Antoine Ratouchniak, Alexis Thual, Teon Brooks, Katelyn Begany, Yohann Benchetrit, Marlene Careil, Hubert Jacob Banville, Stéphane d'Ascoli, Simon Dahan, Jérémy Rapin

May 12, 2026

March 17, 2026

RESEARCH

SPEECH & AUDIO

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

Omnilingual SONAR Team, João Maria Janeiro, Pere Lluís Huguet Cabot, Ioannis Tsiamas, Yen Meng, Vivek Iyer, Guillem Ramirez, Loic Barrault, Belen Alastruey, Yu-An Chung, Marta R. Costa-jussa, David Dale, Kevin Heffernan, Jaehyeong Jo, Artyom Kozhevnikov, Alexandre Mourachko, Christophe Ropers, Holger Schwenk, Paul-Ambroise Duquenne

March 17, 2026

December 16, 2025

SPEECH & AUDIO

COMPUTER VISION

SAM Audio: Segment Anything in Audio

Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollar, Wei-Ning Hsu, Ann Lee

December 16, 2025

December 16, 2025

SPEECH & AUDIO

COMPUTER VISION

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Bernie Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Dollar, Christoph Feichtenhofer, Ann Lee, Wei-Ning Hsu

December 16, 2025

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.