RESEARCH

SPEECH & AUDIO

Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

August 02, 2019

Abstract

In this paper, we describe our submission to the WMT19 low-resource parallel corpus filtering shared task. Our main approach is based on the LASER toolkit (Language-Agnostic SEntence Representations), which uses an encoder-decoder architecture trained on a parallel corpus to obtain multilingual sentence representations. We then use the representations directly to score and filter the noisy parallel sentences without additionally training a scoring function. We contrast our approach to other promising methods and show that LASER yields strong results. Finally, we produce an ensemble of different scoring methods and obtain additional gains. Our submission achieved the best overall performance for both the Nepali–English and Sinhala–English 1M tasks by a margin of 1.3 and 1.4 BLEU respectively, as compared to the second best systems. Moreover, our experiments show that this technique is promising for low and even no-resource scenarios.

Download the Paper

AUTHORS

Written by

Vishrav Chaudhary

Holger Schwenk

Paco Guzmán

Philipp Koehn

Yuqing Tang

Publisher

WMT ACL

Related Publications

April 14, 2024

SPEECH & AUDIO

NLP

CoLLD: Contrastive Layer-to-Layer Distillation for Compressing Multilingual Pre-Trained Speech Encoders

Heng-Jui Chang, Ning Dong (AI), Ruslan Mavlyutov, Sravya Popuri, Andy Chung

April 14, 2024

March 05, 2024

SPEECH & AUDIO

Generative Pre-training for Speech with Flow Matching

Alex Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

March 05, 2024

December 11, 2023

SPEECH & AUDIO

Audiobox: Unified Audio Generation with Natural Language Prompts

Wei-Ning Hsu, Akinniyi Akinyemi, Alice Rakotoarison, Andros Tjandra, Apoorv Vyas, Baishan Guo, Bapi Akula, Bowen Shi, Brian Ellis, Ivan Cruz, Jeff Wang, Jiemin Zhang, Mary Williamson, Matt Le, Rashel Moritz, Robbie Adkins, William Ngan, Xinyue Zhang, Yael Yungster, Yi-Chiao Wu

December 11, 2023

November 30, 2023

SPEECH & AUDIO

NLP

Efficient Monotonic Multihead Attention

Xutai Ma, Anna Sun, Siqi Ouyang, Hirofumi Inaguma, Paden Tomasello

November 30, 2023

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.