Facebook Research at ACL 2020

July 3, 2020

The Association for Computational Linguistics (ACL) conference is taking place online this year from July 5 to July 10. Researchers from Facebook are presenting their work in video spotlights, poster sessions, and other workshop activities. Our researchers, recruiters, and program managers are available through activities to chat about Facebook AI research and potential career opportunities.

We’re sharing details on the research we’re presenting this year, covering key areas of NLP, including pretraining, cross-lingual, datasets and resources, dialogue, machine translation, probing, and more.


Pretraining is widely known as an important step in building the most advanced NLP models, as we’ve seen with BERT (Bidirectional Encoder Representations from Transformers). However, using BERT for sequence decoding tasks pretrains only an encoder, not a decoder. To fill that gap, we’re excited to share BART, a new model pretrained specifically for sequence-to-sequence problems, which not only matches the performance of RoBERTa on classification tasks but also achieves new state of the art on text generation tasks. BART is trained by first corrupting text with an arbitrary noising function. It then learns a model to reconstruct the original text. We’ve released the code and model here.

Something Went Wrong
We're having trouble playing this video.

Facebook AI is also, to our knowledge, the first to apply pretraining for language modeling on both textual and tabular data. This architecture, dubbed TaBERT, is trained on 26 million tables and their English contexts, and improves question-answering performance. We’ve also released the first monolingual transformer-based language model for French, improving over the state of the art in four downstream tasks.

For those who are interested in the intersection between pretraining and efficiency, we’ve examined downstream text classification tasks and found that there is a point where pretraining may not be necessary. Our findings indicate that pretrained language models such as BERT may have diminishing returns when we increase the amount of training data.

Pretraining papers:

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and ComprehensionMike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer

CamemBERT: a Tasty French Language ModelLouis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, Benoît Sagot

TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data Pengcheng Yin, Graham Neubig, Wen-tau Yih, Sebastian Riedel

To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on Resource Rich TasksSinong Wang, Madian Khabsa, Hao Ma


Another core focus area for Facebook AI is cross-lingual research, where models are trained in one language and then used for other languages without additional training data. We’re presenting XLM-R, our state-of-the-art model that performs cross-language tasks with 100 different languages, including low-resource languages. As the first multilingual model to outperform traditional monolingual baselines that rely on pretrained models, it achieved state of the art on four benchmarks, including our question answering dataset MLQA, which we’re also presenting at ACL 2020. The XLM-R model is publicly available here on GitHub, here in HuggingFace Transformers, and here in PyText.

We’ve also explored the underlying mechanics of cross-lingual models and how they work. We’ll discuss interesting, counterintuitive findings on how language universal representations emerge in pretrained models even without shared vocabulary or domain. Beyond helping push advancements in NLP, research like this can also play a role in providing better experiences for everyone, regardless of what language they speak.

Cross-lingual papers:

Unsupervised Cross-lingual Representation Learning at ScaleAlexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Edouard Grave, Guillaume Wenzek, Myle Ott, Ves Stoyanov, Luke Zettlemoyer

Emerging Cross-lingual Structure in Pretrained Language ModelsShijie Wu, Alexis Conneau, Haoran Li, Luke Zettlemoyer, Ves Stoyanov

MLQA: Evaluating Cross-lingual Extractive Question AnsweringPatrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, Holger Schwenk

Data sets & resources

Data sets and benchmarks play an essential role in measuring and improving the performance of NLP models. But over the past several years, with the field advancing remarkably rapidly, current NLP benchmarks can become saturated and limited. That’s why we’ve developed new resources to help researchers better evaluate and improve their models. We’ve built a new large-scale natural language understanding (NLU) benchmark using a unique adversarial human-and-model-in-the-loop process, which creates a never-ending moving target for NLU models rather than just a static benchmark that will tend to quickly saturate.

We’re also introducing two additional resources to help researchers assess weaknesses in current NLU models. This work includes ImpPRESS a new evaluation dataset for pragmatic inference (e.g., do models know that “John ate some of the cookies” will typically be understood as meaning that he did not eat all of the cookies?), as well as a new sentence simplification dataset.

Data sets papers:

Adversarial NLI: A New Benchmark for Natural Language Understanding Yixin Nie, Adina Williams, Emily Dinan,, Mohit Bansal, Jason Weston, Douwe Kiela

Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition Paloma Jeretic, Alex Warstadt, Suvrat Bhooshan, Adina Williams

ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations Fernando Alva Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Benoit Sagot, Lucia Specia


We’re presenting progress for both open-domain chit-chat and task-oriented dialogue as well as general neural text generation. We built and open-sourced the largest-ever open-domain chatbot that can blend conversational skills — like empathy, knowledge, and personality — from different dialogue tasks. We also built a system that can converse about images in different styles, and we released a new dataset of 200,000 image-dialogue pairs.

Separately, we introduced a new dialogue evaluation metric that, unlike most tools, doesn’t require human reference utterances or direct human evaluations while still correlating well with human judgments. Also, we studied large-scale generative dialogue models’ ability to generalize across different tasks with the release of a unique set of 12 GLUE-style tasks called the dodecadialogue challenge.

We also introduced a large-scale dataset and data collection framework for semantic parsing in instruction-driven communication from human to virtual or robotic assistants. And we used unlikelihood training to improve dialogue text generation such as minimizing repetition, overuse of frequent words, logical flaws, and copying from the context.

Outside of dialogue, we also improved general neural text generation with a new retrieve-edit-rerank framework that retrieves candidates, edits them, and reranks edited candidates to produce final output.

Dialogue papers:

Can You Put It All Together: Evaluating Conversational Agents’ Ability to Blend SkillsEric Smith, Mary Williamson, Kurt Shuster, Jason Weston, Y-Lan Boureau

Don’t Say That! Making Inconsistent Dialogue Unlikely with Unlikelihood TrainingMargaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck, Y-Lan Boureau, Kyunghyun Cho, Jason Weston

Image-Chat: Engaging Grounded ConversationsKurt Shuster, Samuel Humeau, Antoine Bordes, Jason Weston

Learning an Unreferenced Metric for Online Dialogue EvaluationKoustuv Sinha, Prasanna Parthasarathi, Jasmine Wang, Ryan Lowe, Will Hamilton, Joelle Pineau

The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational AgentsKurt Shuster, Dexter Ju, Stephen Roller, Emily Dinan, Y-Lan Boureau, Jason Weston

Simple and Effective Retrieve-Edit-Rerank Text Generation Nabil Hossain, Marjan Ghazvininejad, Luke Zettlemoyer

CraftAssist Instruction Parsing: Semantic Parsing for a Minecraft Assistant Yacine Jernite, Kavya Srinet, Jonathan Gray, Arthur Szlam


Evaluation of machine-translated language systems remains an open research challenge. For instance, why do the best-performing back-translation systems preferred by human evaluators perform worse on automatic metrics such as BLEU? We study this discrepancy in detail, carefully dissecting the standard evaluation protocol, and provide suggestions on how to improve automatic evaluation, such as complementing BLEU with a language model score to measure fluency. We also propose an entirely new alternative to MT evaluation: Instead of using the first-best translation generated by a model, we rely on a diverse set of hypotheses generated from the model’s search space using various sampling techniques. We find that evaluations obtained are more robust and correlate better to human judgments. This work helps us measure the quality of MT systems better and more efficiently without additional human references.

In other work, we analyze the state of current quality estimation public datasets and discover key shortcomings, such as biases toward fluency and lack of diversity in topics. We present concrete recommendations to improve the quality of datasets, which is an important step toward improving quality estimation models overall. This work guided us to create MLQE, a multilingual dataset extracted from Wikipedia, which is being used by the community for the multilingual quality estimation shared task at WMT.

In a separate work, we make model improvements by utilizing the latent variables and improve the translation quality on several language pairs, especially when the training data are noisy.

Translation papers:

Multi-Hypothesis Machine Translation Evaluation Marina Fomicheva, Paco Guzmán, Lucia Specia

Are We Estimating or Guesstimating Translation Quality?Shuo Sun, Paco Guzmán, Lucia Specia

On the Evaluation of Machine Translation Systems Trained with Back-TranslationSergey Edunov, Myle Ott, Marc'Aurelio Ranzato, Michael Auli

Addressing Posterior Collapse with Mutual Information for Improved Variational Neural Machine Translation Arya D. McCarthy, Xian Li, Jiatao Gu, Ning Dong


To improve NLP models, it’s crucial for us to understand where they fail and where they succeed. One common method to try to understand NLP models is to train “probes” — models that are designed to find linguistic structure hidden in the representations output by other models.

We’re presenting a series of papers that shed light onto how these probes work and encourage researchers to rethink their approach to training them. For instance, we argue that bigger, more complex probes are better than the more common and intuitive simple approach. By leveraging a novel information-theoretical perspective, the utility of complex probes becomes more immediately apparent.

In other work, we compare a new probe with a traditional parser and show that probes are not fundamentally different from the models they probe. And in complementary research, we create a novel dataset to test whether words have consistent contributions to the meaning of the sentence in our NLP models. We find that these models tend to generalize in ways that allow the meanings of individual words to vary in different contexts.

Probing papers:

A Tale of a Probe and a Parser Rowan Hall Maudsley, Joseph Valvoda, Tiago Pimentel, Adina Williams, Ryan Cotterell

Probing Linguistic Similarity Emily Goodwin, Koustuv Sinha, Timothy J. O’Donnell

Information-Theoretic Probing for Linguistic Structure Tiago Pimentel, Joseph Valvoda, Rowan Hall Maudsley, Ran Zmigrod, Adina Williams, Ryan Cotterell

Research on emergent languages, disambiguation, and more

In addition to the research detailed above, we’re presenting a range of noteworthy papers that fall outside of the categories defined above:

Active Learning for Coreference Resolution Using Discrete AnnotationBelinda Z. Li, Gabriel Stanovsky, Luke Zettlemoyer

We improve upon pairwise annotation for active learning in coreference resolution, by asking annotators to identify mention antecedents if a presented mention pair is deemed not coreferent. This simple modification, when combined with a novel mention clustering algorithm for selecting which examples to label, is much more efficient in terms of the performance obtained per annotation budget. In experiments with existing benchmark coreference datasets, we show that the signal from this additional question leads to significant performance gains per human-annotation hour. Future work can use our annotation protocol to effectively develop coreference models for new domains. Our code is publicly available at this https URL .

Asking and Answering Questions to Evaluate the Factual Consistency of Summaries Alex Wang, Kyunghyun Cho, Mike Lewis

Practical applications of abstractive summarization models are limited by frequent factual inconsistencies with respect to their input. Existing automatic evaluation metrics for summarization are largely insensitive to such errors. We propose an automatic evaluation protocol called QAGS (pronounced "kags") that is designed to identify factual inconsistencies in a generated summary. QAGS is based on the intuition that if we ask questions about a summary and its source, we will receive similar answers if the summary is factually consistent with the source. To evaluate QAGS, we collect human judgments of factual consistency on model-generated summaries for the CNN/DailyMail (Hermann et al., 2015) and XSUM (Narayan et al., 2018) summarization datasets. QAGS has substantially higher correlations with these judgments than other automatic evaluation metrics. Also, QAGS offers a natural form of interpretability: The answers and questions generated while computing QAGS indicate which tokens of a summary are inconsistent and why. We believe QAGS is a promising tool in automatically generating usable and factually consistent text.

Compositionality and Generalization In Emergent LanguagesRahma Chaabouni, Eugene Kharitonov, Diane Bouchacourt, Emmanuel Dupoux, Marco Baroni

Natural language allows us to refer to novel composite concepts by combining expressions denoting their parts according to systematic rules, a property known as \emph{compositionality}. In this paper, we study whether the language emerging in deep multi-agent simulations possesses a similar ability to refer to novel primitive combinations, and whether it accomplishes this feat by strategies akin to human-language compositionality. Equipped with new ways to measure compositionality in emergent languages inspired by disentanglement in representation learning, we establish three main results. First, given sufficiently large input spaces, the emergent language will naturally develop the ability to refer to novel composite concepts. Second, there is no correlation between the degree of compositionality of an emergent language and its ability to generalize. Third, while compositionality is not necessary for generalization, it provides an advantage in terms of language transmission: The more compositional a language is, the more easily it will be picked up by new learners, even when the latter differ in architecture from the original agents. We conclude that compositionality does not arise from simple generalization pressure, but if an emergent language does chance upon it, it will be more likely to survive and thrive.

Joint Modeling of Emotion and Abusive Language Detection Santhosh Rajamanickam, Pushkar Mishra, Helen Yannakoudakis, Ekaterina Shutova

The rise of online communication platforms has been accompanied by some undesirable effects, such as the proliferation of aggressive and abusive behaviour online. Aiming to tackle this problem, the natural language processing (NLP) community has experimented with a range of techniques for abuse detection. While achieving substantial success, these methods have so far only focused on modelling the linguistic properties of the comments and the online communities of users, disregarding the emotional state of the users and how this might affect their language. The latter is, however, inextricably linked to abusive behaviour. In this paper, we present the first joint model of emotion and abusive language detection, experimenting in a multi-task learning framework that allows one task to inform the other. Our results demonstrate that incorporating affective features leads to significant improvements in abuse detection performance across datasets.

Language Models as Fact Checkers? Nayeon Lee, Belinda Z. Li, Sinong Wang, Wen-tau Yih, Hao Ma, Madian Khabsa

Recent work has suggested that language models (LMs) store both common-sense and factual knowledge learned from pre-training data. In this paper, we leverage this implicit knowledge to create an effective end-to-end fact checker using a solely a language model, without any external knowledge or explicit retrieval components. While previous work on extracting knowledge from LMs have focused on the task of open-domain question answering, to the best of our knowledge, this is the first work to examine the use of language models as fact checkers. In a closed-book setting, we show that our zero-shot LM approach outperforms a random baseline on the standard FEVER task, and that our fine-tuned LM compares favorably with standard baselines. Though we do not ultimately outperform methods which use explicit knowledge bases, we believe our exploration shows that this method is viable and has much room for exploration.

Moving Down the Long Tail of Word Sense Disambiguation with Gloss Informed Bi-encodersTerra Blevins, Luke Zettlemoyer

A major obstacle in Word Sense Disambiguation (WSD) is that word senses are not uniformly distributed, causing existing models to generally perform poorly on senses that are either rare or unseen during training. We propose a bi-encoder model that independently embeds (1) the target word with its surrounding context and (2) the dictionary definition, or gloss, of each sense. The encoders are jointly optimized in the same representation space, so that sense disambiguation can be performed by finding the nearest sense embedding for each target word embedding. Our system outperforms previous state-of-the-art models on English all-words WSD; these gains predominantly come from improved performance on rare senses, leading to a 31.1% error reduction on less frequent senses over prior work. This demonstrates that rare senses can be more effectively disambiguated by modeling their definitions.

On the Relationships Between the Grammatical Genders of Inanimate Nouns and Their Co-Occurring Adjectives and VerbsAdina Williams, Ryan Cotterell, Lawrence Wolf-Sonkin, Damián Blasi, Hanna Wallach

We use large-scale corpora in six different gendered languages, along with tools from NLP and information theory, to test whether there is a relationship between the grammatical genders of inanimate nouns and the adjectives used to describe those nouns. For all six languages, we find that there is a statistically significant relationship. We also find that there are statistically significant relationships between the grammatical genders of inanimate nouns and the verbs that take those nouns as direct objects, as indirect objects, and as subjects. We defer a deeper investigation of these relationships for future work.

Predicting Declension Class from Form and MeaningAdina Williams, Tiago Pimentel, Arya D. McCarthy, Hagen Blix, Eleanor Chodroff, Ryan Cotterell

The noun lexica of many natural languages are divided into several declension classes with characteristic morphological properties. Class membership is far from deterministic, but the phonological form of a noun and/or its meaning can often provide imperfect clues. Here, we investigate the strength of those clues. More specifically, we operationalize this by measuring how much information, in bits, we can glean about declension class from knowing the form and/or meaning of nouns. We know that form and meaning are often also indicative of grammatical gender---which, as we quantitatively verify, can itself share information with declension class---so we also control for gender. We find for two Indo-European languages (Czech and German) that form and meaning respectively share significant amounts of information with class (and contribute additional information above and beyond gender). The three-way interaction between class, form, and meaning (given gender) is also significant. Our study is important for two reasons: First, we introduce a new method that provides additional quantitative support for a classic linguistic finding that form and meaning are relevant for the classification of nouns into declensions. Secondly, we show not only that individual declensions classes vary in the strength of their clues within a language, but also that these variations themselves vary across languages.

Facebook AI workshops & tutorials at ACL 2020


The Simultaneous Speech Translation Track @ The 17th International Conference on Spoken Language Translation (IWSLT), July 9 & 10 Jiatao Gu, Juan Pino, and Changhan Wang are organizers.

The 2nd Workshop on NLP for Conversational AI, July 9 Antoine Bordes is a keynote speaker, and Anuj Kumar is an organizer.

The 1st Joint Workshop on Narrative Understanding, Storylines, and Events (NUSE), July 9 Angela Fan is an invited speaker.

Workshop on Advances in Language and Vision Research (ALVR), July 9 Xinlei Chen is an organizer.

The 5th Workshop on Representation Learning for NLP (RepL4NLP-2020), July 9 Mike Lewis is a keynote speaker, and Fabio Petroni is an organizer.

The 1st Workshop on Natural Language Interfaces, July 10 Luke Zettlemoyer is an invited speaker, and Scott Wen-tau Yih is an organizer.

The 4th Workshop on Neural Generation and Translation, July 10 Jiatao Gu is an invited speaker, and Xian Li is an organizer.

The 1st Workshop on Automatic Simultaneous Translation, July 10 James Cross is an organizer.


Open-Domain Question Answering (Cutting-edge), July 5

Scott Wen-tau Yih is an organizer.