RESEARCH

SPEECH & AUDIO

Weak-Attention Suppression For Transformer Based Speech Recognition

August 13, 2020

Abstract

Transformers, originally proposed for natural language processing (NLP) tasks, have recently achieved great success in automatic speech recognition (ASR). However, adjacent acoustic units (i.e., frames) are highly correlated, and long-distance dependencies between them are weak, unlike text units. It suggests that ASR will likely benefit from sparse and localized attention. In this paper, we propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities. We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement over strong transformer baselines. On the widely used LibriSpeech benchmark, our proposed method reduced WER by 10% on test-clean and 5% on test-other for streamable transformers, resulting in a new state-of-the-art among streaming models. Further analysis shows that WAS learns to suppress attention of non-critical and redundant continuous acoustic frames, and is more likely to suppress past frames rather than future ones. It indicates the importance of lookahead in attention-based ASR models.

Download the Paper

AUTHORS

Written by

Yangyang Shi

Ching-Feng Yeh

Christian Fuegen

Chunyang Wu

Duc Le

Frank Zhang

Mike Seltzer

Yongqiang Wang

Publisher

Interspeech

Related Publications

December 18, 2025

RESEARCH

COMPUTER VISION

Pixel Seal: Adversarial-only training for invisible image and video watermarking

Tomáš Souček, Pierre Fernandez, Hady Elsahar, Sylvestre Rebuffi, Valeriu Lacatusu, Tuan Tran, Tom Sander, Alexandre Mourachko

December 18, 2025

December 16, 2025

SPEECH & AUDIO

COMPUTER VISION

SAM Audio: Segment Anything in Audio

Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollar, Wei-Ning Hsu, Ann Lee

December 16, 2025

December 16, 2025

SPEECH & AUDIO

COMPUTER VISION

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Bernie Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Dollar, Christoph Feichtenhofer, Ann Lee, Wei-Ning Hsu

December 16, 2025

November 19, 2025

RESEARCH

COMPUTER VISION

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris Coll-Vinent, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding, Sagar Vaze, Francois Porcher, Feng Li, Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Dollar, Nikhila Ravi, Kate Saenko, Pengchuan Zhang, Christoph Feichtenhofer

November 19, 2025

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.