COMPUTER VISION

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

April 20, 2023

Abstract

Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems with synthetic lip movements. The key idea behind SynthVSR is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech. The speech-driven lip animation model is trained on an unlabeled audio-visual dataset and could be further optimized towards a pre-trained VSR model when labeled videos are available. As plenty of transcribed acoustic data and face images are available, we are able to generate large-scale synthetic data using the proposed lip animation model for semi-supervised VSR training. We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3). SynthVSR achieves a WER of 43.3% with only 30 hours of real labeled data, outperforming off-the-shelf approaches using thousands of hours of video. The WER is further reduced to 27.9% when using all 438 hours of labeled data from LRS3, which is on par with the state-of-the-art self-supervised AV-HuBERT method. Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16.9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90,000 hours). Finally, we perform extensive ablation studies to understand the effect of each component in our proposed method.

Download the Paper

AUTHORS

Written by

Xubo Liu

Egor Lakomkin

Dino Vougioukas

Pingchuan Ma

Honglie Chen

Ruiming Xie

Morrie Doulaty

Niko Moritz

Jachym Kolar

Stavros Petridis

Maja Pantic

Christian Fuegen

Publisher

CVPR

Research Topics

Computer Vision

Related Publications

July 23, 2024

COMPUTER VISION

Imagine yourself: Tuning-Free Personalized Image Generation

Zecheng He, Bo Sun, Felix Xu, Haoyu Ma, Ankit Ramchandani, Vincent Cheung, Siddharth Shah, Anmol Kalia, Ning Zhang, Peizhao Zhang, Roshan Sumbaly, Peter Vajda, Animesh Sinha

July 23, 2024

July 23, 2024

HUMAN & MACHINE INTELLIGENCE

CONVERSATIONAL AI

The Llama 3 Herd of Models

Llama team

July 23, 2024

July 02, 2024

GRAPHICS

COMPUTER VISION

Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials

Yawar Siddiqui, Tom Monnier, Filippos Kokkinos, Mahendra Kariya, Yanir Kleiman, Emilien Garreau, Oran Gafni, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, David Novotny

July 02, 2024

July 02, 2024

GRAPHICS

COMPUTER VISION

Meta 3D Gen

Raphael Bensadoun, Tom Monnier, Yanir Kleiman, Filippos Kokkinos, Yawar Siddiqui, Mahendra Kariya, Omri Harosh, Roman Shapovalov, Emilien Garreau, Animesh Karnewar, Ang Cao, Idan Azuri, Iurii Makarov, Eric-Tuan Le, Antoine Toisoul, David Novotny, Oran Gafni, Natalia Neverova, Andrea Vedaldi

July 02, 2024

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.