RESEARCH

SPEECH & AUDIO

Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

August 02, 2019

Abstract

In this paper, we describe our submission to the WMT19 low-resource parallel corpus filtering shared task. Our main approach is based on the LASER toolkit (Language-Agnostic SEntence Representations), which uses an encoder-decoder architecture trained on a parallel corpus to obtain multilingual sentence representations. We then use the representations directly to score and filter the noisy parallel sentences without additionally training a scoring function. We contrast our approach to other promising methods and show that LASER yields strong results. Finally, we produce an ensemble of different scoring methods and obtain additional gains. Our submission achieved the best overall performance for both the Nepali–English and Sinhala–English 1M tasks by a margin of 1.3 and 1.4 BLEU respectively, as compared to the second best systems. Moreover, our experiments show that this technique is promising for low and even no-resource scenarios.

Download the Paper

AUTHORS

Written by

Vishrav Chaudhary

Holger Schwenk

Paco Guzmán

Philipp Koehn

Yuqing Tang

Publisher

WMT ACL

Related Publications

May 14, 2025

RESEARCH

CORE MACHINE LEARNING

UMA: A Family of Universal Models for Atoms

Brandon M. Wood, Misko Dzamba, Xiang Fu, Meng Gao, Muhammed Shuaibi, Luis Barroso-Luque, Kareem Abdelmaqsoud, Vahe Gharakhanyan, John R. Kitchin, Daniel S. Levine, Kyle Michel, Anuroop Sriram, Taco Cohen, Abhishek Das, Ammar Rizvi, Sushree Jagriti Sahoo, Zachary W. Ulissi, C. Lawrence Zitnick

May 14, 2025

May 14, 2025

HUMAN & MACHINE INTELLIGENCE

SPEECH & AUDIO

Emergence of Language in the Developing Brain

Linnea Evanson, Christine Bulteau, Mathilde Chipaux, Georg Dorfmüller, Sarah Ferrand-Sorbets, Emmanuel Raffo, Sarah Rosenberg, Pierre Bourdillon, Jean Remi King

May 14, 2025

May 13, 2025

HUMAN & MACHINE INTELLIGENCE

RESEARCH

Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI

Marlène Careil, Yohann Benchetrit, Jean-Rémi King

May 13, 2025

April 25, 2025

RESEARCH

NLP

ReasonIR: Training Retrievers for Reasoning Tasks

Rulin Shao, Qiao Rui, Varsha Kishore, Niklas Muennighoff, Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Scott Yih, Pang Wei Koh, Luke Zettlemoyer

April 25, 2025

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.