Research

NLP

Using Web Text to Improve Keyword Spotting in Speech

December 8, 2013

Abstract

For low resource languages, collecting sufficient training data to build acoustic and language models is time consuming and often expensive. But large amounts of text data, such as online newspapers, web forums or online encyclopedias, usually exist for languages that have a large population of native speakers. This text data can be easily collected from the web and then used to both expand the recognizer’s vocabulary and improve the language model. One challenge, however, is normalizing and filtering the web data for a specific task. In this paper, we investigate the use of online text resources to improve the performance of speech recognition specifically for the task of keyword spotting. For the five languages provided in the base period of the IARPA BABEL project, we automatically collected text data from the web using only LimitedLP resources. We then compared two methods for filtering the web data, one based on perplexity ranking and the other based on out-of-vocabulary (OOV) word detection. By integrating the web text into our systems, we observed significant improvements in keyword spotting accuracy for four out of the five languages. The best approach obtained an improvement in actual term weighted value (ATWV) of 0.0424 compared to a baseline system trained only on LimitedLP resources. On average, ATWV was improved by 0.0243 across five languages.

Download the Paper

Related Publications

April 25, 2025

NLP

ReasonIR: Training Retrievers for Reasoning Tasks

Rulin Shao, Qiao Rui, Varsha Kishore, Niklas Muennighoff, Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Scott Yih, Pang Wei Koh, Luke Zettlemoyer

April 25, 2025

April 17, 2025

Human & Machine Intelligence

Conversational AI

Collaborative Reasoner: Self-improving Social Agents with Synthetic Conversations

Ansong Ni, Ruta Desai, Yang Li, Xinjie Lei, Dong Wang, Ramya Raghavendra, Gargi Ghosh, Daniel Li (FAIR), Asli Celikyilmaz

April 17, 2025

March 17, 2025

NLP

reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs

Zhaofeng Wu, Michihiro Yasunaga, Andrew Cohen, Yoon Kim, Asli Celikyilmaz, Marjan Ghazvininejad

March 17, 2025

February 06, 2025

NLP

Brain-to-Text Decoding: A Non-invasive Approach via Typing

Jarod Levy, Mingfang (Lucy) Zhang, Svetlana Pinet, Jérémy Rapin, Hubert Jacob Banville, Stéphane d'Ascoli, Jean Remi King

February 06, 2025

April 30, 2018

NLP

Speech & Audio

Identifying Analogies Across Domains | Facebook AI Research

Yedid Hoshen, Lior Wolf

April 30, 2018

November 01, 2018

NLP

Computer Vision

Non-Adversarial Unsupervised Word Translation | Facebook AI Research

Yedid Hoshen, Lior Wolf

November 01, 2018

December 02, 2018

NLP

Computer Vision

One-Shot Unsupervised Cross Domain Translation | Facebook AI Research

Sagie Benaim, Lior Wolf

December 02, 2018

June 30, 2019

NLP

Variational Training for Large-Scale Noisy-OR Bayesian Networks | Facebook AI Research

Geng Ji, Dehua Cheng, Huazhong Ning, Changhe Yuan, Hanning Zhou, Liang Xiong, Erik B. Sudderth

June 30, 2019

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.