December 4, 2014
We introduce two methods to collect additional training data for statistical machine translation systems from public social network content. The first method identifies multilingual content where the author self-translated their own post to reach additional friends, fans or customers. Once identified, we can split the post in the language segments and extract translation pairs from this content. The second methods considers web links (URLs) that users add as part of their post to point the reader to a video, article or website. If the same URL is shared from different language users, there is a chance they might give the same comment in their respective language. We use a support vector machine (SVM) as a classifier to identify true translations from all candidate pairs. We collected additional translation pairs using both methods for the language pairs Spanish-English and Portuguese-English. Testing the collected data as additional training data for statistical machine translations on in-domain test sets resulted in very significant improvements of up to 5 BLEU.
Research Topics
November 16, 2022
Kushal Tirumala, Aram H. Markosyan, Armen Aghajanyan, Luke Zettlemoyer
November 16, 2022
October 31, 2022
Fabio Petroni, Giuseppe Ottaviano, Michele Bevilacqua, Patrick Lewis, Scott Yih, Sebastian Riedel
October 31, 2022
December 06, 2020
Michael Lewis, Armen Aghajanyan, Gargi Ghosh, Luke Zettlemoyer, Marjan Ghazvininejad, Sida Wang
December 06, 2020
November 30, 2020
Dhruv Batra, Devi Parikh, Meera Hahn, Jacob Krantz, James Rehg, Peter Anderson, Stefan Lee
November 30, 2020
April 30, 2018
Yedid Hoshen, Lior Wolf
April 30, 2018
November 01, 2018
Yedid Hoshen, Lior Wolf
November 01, 2018
December 02, 2018
Sagie Benaim, Lior Wolf
December 02, 2018
June 30, 2019
Geng Ji, Dehua Cheng, Huazhong Ning, Changhe Yuan, Hanning Zhou, Liang Xiong, Erik B. Sudderth
June 30, 2019
Foundational models
Latest news
Foundational models