Products

AI Research

Resources

About

Overview
Open Source
Careers

Wei-Ning Hsu

RESEARCH SCIENTIST | NEW YORK CITY, UNITED STATES

Wei-Ning is a research scientist at Meta AI (f.k.a FAIR). His research focuses on representation learning, self-supervised learning, and structured generative modeling for unimodal and multimodal speech. He is passionate about reducing the supervision required for various speech applications and developing technologies applicable to both written and unwritten languages.

Prior to joining Facebook. Wei-Ning received his Ph.D. and S.M. degrees in Electrical Engineering and Computer Science from Massachusetts Institute of Technology in 2020 and 2018. He received his B.S. degree in Electrical Engineering from National Taiwan University in 2014.

Twitter

Google Scholar

Personal Website

Research Areas

Computer Vision

Natural Language Processing (NLP)

Speech & Audio

Wei-Ning's Work

Audio-Visual HuBERT

wav2vec-U

data2vec

Textless NLP

Textless Speech-to-Speech Translation

Wei-Ning's Publications

December 16, 2025

SPEECH & AUDIO

COMPUTER VISION

SAM Audio: Segment Anything in Audio

Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollar, Wei-Ning Hsu, Ann Lee

December 16, 2025

Wei-Ning Hsu

RESEARCH SCIENTIST | NEW YORK CITY, UNITED STATES

Twitter

Google Scholar

Personal Website

Research Areas

Computer Vision

Natural Language Processing (NLP)

Speech & Audio

Wei-Ning's Work

Audio-Visual HuBERT

wav2vec-U

data2vec

Textless NLP

Textless Speech-to-Speech Translation

Wei-Ning's Publications

SPEECH & AUDIO

COMPUTER VISION

SAM Audio: Segment Anything in Audio

SPEECH & AUDIO

COMPUTER VISION

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

RESEARCH

SPEECH & AUDIO

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

SPEECH & AUDIO

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

SPEECH & AUDIO

NLP

Toward Joint Language Modeling for Speech Units and Text

SPEECH & AUDIO

Generative Pre-training for Speech with Flow Matching

SPEECH & AUDIO

Audiobox: Unified Audio Generation with Natural Language Prompts

SPEECH & AUDIO

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

NLP

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

NLP

MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

NLP

COMPUTER VISION

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

SPEECH & AUDIO

NLP

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

SPEECH & AUDIO

NLP

Scaling Speech Technology to 1,000+ Languages

SPEECH & AUDIO

NLP

Cocktail HuBERT: Generalized Self-Supervised Pre-Training for Mixture and Single-Source Speech

NLP

Textless Speech Emotion Conversion using Discrete & Decomposed Representations

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

NLP

Unsupervised Speech Recognition