SPEECH & AUDIO

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

October 22, 2023

Abstract

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.

Download the Paper

AUTHORS

Written by

Michael Auli

Wei-Ning Hsu

Alexander Liu

Heng-Jui Chang

James Glass

Publisher

NeurIPS

Research Topics

Speech & Audio

Related Publications

July 23, 2024

HUMAN & MACHINE INTELLIGENCE

CONVERSATIONAL AI

The Llama 3 Herd of Models

Llama team

July 23, 2024

June 25, 2024

SPEECH & AUDIO

NLP

Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation

Min-Jae Hwang, Ilia Kulikov, Benjamin Peloquin, Hongyu Gong, Peng-Jen Chen, Ann Lee

June 25, 2024

June 05, 2024

SPEECH & AUDIO

Proactive Detection of Voice Cloning with Localized Watermarking

Robin San Romin, Pierre Fernandez, Hady Elsahar, Alexandre Deffosez, Teddy Furon, Tuan Tran

June 05, 2024

May 24, 2024

SPEECH & AUDIO

NLP

DOC-RAG: ASR Language Model Personalization with Domain-Distributed Co-occurrence Retrieval Augmentation

Zhe Liu

May 24, 2024

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.