Facebook Research at ICLR 2019

May 06, 2019

Machine learning researchers and engineers are gathering in New Orleans from May 6 to May 9 at the International Conference on Learning Representations (ICLR) to present and publish cutting-edge research. Facebook researchers will be participating in several activities, including an Expo session entitled AI Research Using PyTorch: Bayesian Optimization, Billion Edge Graphs and Private Deep Learning.

This year, Yann LeCun (Facebook, New York University), Geoffrey Hinton (Google, Vector Institute, and University of Toronto), and Yoshua Bengio (MILA and University of Montreal) were presented the A.M. Turing Award for their contributions to deep learning and modern AI. In the late '80s, Yann figured out how to train convolutional neural networks (CNNs) using a back-propagation algorithm, and he pioneered several successful vision applications (e.g., check reading, face recognition) using learning-based methods. Yann's most influential publications are Gradient-Based Learning Applied to Document Recognition (1998), Deep Learning (2015), and Backpropagation Applied to Handwritten Zip Code Recognition (1989).

At ICLR 2019, Yann is the co-author of two research papers, The role of Over-parametrization in Generalization of Neural Networks and Model-Predictive Policy Learning with Uncertainty Regularization for Driving in Dense Traffic. Several Facebook researchers are also presenting their research in the form of posters and invited talks.

For those attending ICLR, be sure to stop by the Facebook Research exhibit booth no. 505, where we will showcase some of our latest technologies. We will show updates to PyTorch libraries, and you can demo prototypes of research, such as Recipe Creator and Music Translation. Recruiters and program managers will also be at the booth with job and program information.

Facebook research being presented at ICLR 2019

A Universal Music Translation Network

Noam Mor, Lior Wolf, Adam Polyak, Yaniv Taigman

We present a method for translating music across musical instruments and styles. This method is based on unsupervised training of a multidomain wavenet autoencoder, with a shared encoder and a domain-independent latent space that is trained end-to-end on waveforms. Employing a diverse training dataset and large net capacity, the single encoder allows us to translate also from musical domains that were not seen during training. We evaluate our method on a dataset collected from professional musicians and achieve convincing translations. We also study the properties of the obtained translation and demonstrate translating even from a whistle, potentially enabling the creation of instrumental music by untrained humans.

A Variational Inequality Perspective on GANs

Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, Simon Lacoste-Julien

Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization problems in the general variational inequality framework. Tapping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend techniques designed for variational inequalities to the training of GANs. We apply averaging, extrapolation, and a computationally cheaper variant that we call extrapolation from the past to the stochastic gradient method (SGD) and Adam.

Adaptive Input Representations for Neural Language Modeling

Alexei Baevski, Michael Auli

We introduce adaptive input representations for neural language modeling that extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity. There are several choices on how to factorize the input and output layers, and whether to model words, characters or subword units. We perform a systematic comparison of popular choices for a self-attentional architecture. Our experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters. We achieve a new state of the art on the WIKITEXT-103 benchmark of 20.51 perplexity, improving the next best known result by 8.7 perplexity. On the BILLION WORD benchmark, we achieve 23.02 perplexity.

Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees

Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, Tengyu Ma

Model-based reinforcement learning (RL) is considered to be a promising approach to reduce the sample complexity that hinders model-free RL. However, the theoretical understanding of such methods has been rather limited. This paper introduces a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees. We design a meta-algorithm with a theoretical guarantee of monotone improvement to a local maximum of the expected reward. The meta-algorithm iteratively builds a lower bound of the expected reward based on the estimated dynamical model and sample trajectories, and then maximizes the lower bound jointly over the policy and the model. The framework extends the optimism-in-face-of-uncertainty principle to nonlinear dynamical models in a way that requires no explicit uncertainty quantification. Instantiating our framework with simplification gives a variant of model-based RL algorithms Stochastic Lower Bounds Optimization (SLBO). Experiments demonstrate that SLBO achieves state-of-the-art performance when only one million or fewer samples are permitted on a range of continuous control benchmark tasks.

Code2seq: Generating Sequences from Structured Representations of Code

Uri Alon, Shaked Brody, Omer Levy, Eran Yahav

The ability to generate natural language sequences from source code snippets has a variety of applications, such as code summarization, documentation, and retrieval. Sequence-to-sequence (seq2seq) models, adopted from neural machine translation (NMT), have achieved state-of-the-art performance on these tasks by treating source code as a sequence of tokens. We present CODE2SEQ, an alternative approach that leverages the syntactic structure of programming languages to better encode source code. Our model represents a code snippet as the set of compositional paths in its abstract syntax tree (AST) and uses attention to select the relevant paths while decoding. We demonstrate the effectiveness of our approach for two tasks, two programming languages, and four datasets of up to 16M examples. Our model significantly outperforms previous models specifically designed for programming languages as well as state-of-the-art NMT models. An online demo of our model is available at Our code, data, and trained models are available at

Efficient Lifelong Learning with A-GEM

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, Mohamed Elhoseiny

In lifelong learning, the learner is presented with a sequence of tasks, incrementally building a data-driven prior which may be leveraged to speed up learning of a new task. In this work, we investigate the efficiency of current lifelong approaches, in terms of sample complexity and computational and memory cost. Toward this end, we first introduce a new, more realistic evaluation protocol, whereby learners observe each example only once and hyper-parameter selection is done on a small and disjoint set of tasks, which is not used for the actual learning experience and evaluation. Second, we introduce a new metric measuring how quickly a learner acquires a new skill. Third, we propose an improved version of GEM (Lopez-Paz & Ranzato, 2017), dubbed Averaged GEM (A-GEM), which enjoys the same performance as GEM or even better, while being almost as computationally and memory-efficient as EWC (Kirkpatrick et al., 2016) and other regularization-based methods. Finally, we show that all algorithms including A-GEM can learn even more quickly if they are provided with task descriptors specifying the classification tasks under consideration. Our experiments on several standard lifelong learning benchmarks demonstrate that A-GEM has the best trade-off between accuracy and efficiency.

Environment Probing Interaction Policies

Wenxuan Zhou, Lerrel Pinto, Abhinav Gupta

A key challenge in reinforcement learning (RL) is environment generalization: A policy trained to solve a task in one environment often fails to solve the same task in a slightly different test environment. A common approach to improve inter-environment transfer is to learn policies that are invariant to the distribution of testing environments. However, we argue that instead of being invariant, the policy should identify the specific nuances of an environment and exploit them to achieve better performance. In this work, we propose the Environment-Probing Interaction (EPI) policy, which probes a new environment to extract an implicit understanding of that environment’s behavior. Once this environment-specific information is obtained, it is used as an additional input to a task-specific policy that can now perform environment-conditioned actions to solve a task. To learn these EPI policies, we present a reward function based on transition predictability. Specifically, a higher reward is given if the trajectory generated by the EPI policy can be used to better predict transitions. We experimentally show that EPI-conditioned task-specific policies significantly outperform commonly used policy generalization methods on novel testing environments.

Equi-normalization of Neural Networks

Pierre Stock, Benjamin Graham, Rémi Gribonval, Hervé Jégou

Modern neural networks are over-parametrized. In particular, each rectified linear hidden unit can be modified by a multiplicative factor by adjusting input and output weights, without changing the rest of the network. Inspired by the Sinkhorn-Knopp algorithm, we introduce a fast iterative method for minimizing the l2 norm of the weights, equivalently the weight decay regularizer. It provably converges to a unique solution. Interleaving our algorithm with SGD during training improves the test accuracy. For small batches, our approach offers an alternative to batch- and group- normalization on CIFAR-10 and ImageNet with a ResNet-18.

Fluctuation-Dissipation Relations for Stochastic Gradient Descent

Sho Yaida

The notion of the stationary equilibrium ensemble has played a central role in statistical mechanics. In machine learning as well, training serves as generalized equilibration that drives the probability distribution of model parameters toward stationarity. Here, we derive stationary fluctuation-dissipation relations that link measurable quantities and hyperparameters in the stochastic gradient descent algorithm. These relations hold exactly for any stationary state and can in particular be used to adaptively set training schedule. We can further use the relations to efficiently extract information pertaining to a loss-function landscape such as the magnitudes of its Hessian and anharmonicity. Our claims are empirically verified.

Generative Question Answering: Learning to Answer the Whole Question

Michael Lewis, Angela Fan

Discriminative question answering models can overfit to superficial biases in datasets, because their loss function saturates when any clue makes the answer likely. We introduce generative models of the joint distribution of questions and answers, which are trained to explain the whole question, not just answer it. Our question answering (QA) model is implemented by learning a prior over answers, and a conditional language model to generate the question given the answer — allowing scalable and interpretable many-hop reasoning as the question is generated word-by-word. Our model achieves competitive performance with comparable discriminative models on the SQUAD and CLEVR benchmarks, indicating that it is a more general architecture for language understanding and reasoning than previous work. The model greatly improves generalization both from biased training data and to adversarial testing data, achieving state-of-the-art results on ADVERSARIALSQUAD.

Hierarchical Proprioceptive Controllers for Locomotion in Mazes

Kenneth Marino, Abhinav Gupta, Rob Fergus, Arthur Szlam

In this work we introduce a simple, robust approach to hierarchically training an agent in the setting of sparse reward tasks. The agent is split into a low-level and a high-level policy. The low-level policy accesses only internal, proprioceptive dimensions of the state observation. The low-level policies are trained with a simple reward that encourages changing the values of the non-proprioceptive dimensions. Furthermore, it is induced to be periodic with the use a phase function. The high-level policy is trained using a sparse, task-dependent reward and operates by choosing which of the low-level policies to run at any given time. Using this approach, we solve difficult maze and navigation tasks with sparse rewards using the Mujoco Ant and Humanoid agents and show improvement over recent hierarchical methods.

Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future

Nan Rosemary Ke, Amanpreet Singh, Ahmed Touati, Anirudh Goyal, Yoshua Bengio, Devi Parikh, Dhruv Batra

In model-based reinforcement learning, the agent interleaves between model learning and planning. These two components are inextricably intertwined. If the model is not able to provide sensible long-term prediction, the executed planner would exploit model flaws, which can yield catastrophic failures. This paper focuses on building a model that reasons about the long-term future and demonstrates how to use this for efficient planning and exploration. To this end, we build a latent-variable autoregressive model by leveraging recent ideas in variational inference. We argue that forcing latent variables to carry future information through an auxiliary task substantially improves long-term predictions. Moreover, by planning in the latent space, the planner’s solution is ensured to be within regions where the model is valid. An exploration strategy can be devised by searching for unlikely trajectories under the model. Our method achieves higher reward faster when compared with baselines on a variety of tasks and environments in both the imitation learning and model-based reinforcement learning settings.

Learning Exploration Policies for Navigation

Tao Chen, Saurabh Gupta, Abhinav Gupta

Numerous past works have tackled the problem of task-driven navigation. But, how to effectively explore a new environment to enable a variety of down-stream tasks has received much less attention. In this work, we study how agents can autonomously explore realistic and complex 3D environments without the context of task rewards. We propose a learning-based approach and investigate different policy architectures, reward functions, and training paradigms. We find that use of policies with spatial memory that are bootstrapped with imitation learning and finally fine-tuned with coverage rewards derived purely from onboard sensors can be effective at exploring novel environments. We show that our learned exploration policies can explore better than classical approaches based on geometry alone and generic learning-based exploration techniques. Finally, we also show how such task-agnostic exploration can be used for down-stream tasks. Videos are available at:

Learning When to Communicate at Scale in Multi-agent Cooperative and Competitive Tasks

Amanpreet Singh, Tushar Jain and Sainbayar Sukhbaatar

Learning when to communicate and doing it effectively is essential in multi-agent tasks. Recent works show that continuous communication allows efficient training with back-propagation in multi-agent scenarios, but have been restricted to fully cooperative tasks. In this paper, we present the Individualized Controlled Continuous Communication Model (IC3Net), which has better training efficiency than the simple continuous communication model, and can be applied to semicooperative and competitive settings along with the cooperative settings. IC3Net controls continuous communication with a gating mechanism and uses individualized rewards foreach agent to gain better performance and scalability while fixing credit assignment issues. Using variety of tasks, including StarCraft BroodWars explore and combat scenarios, we show that our network yields improved performance and convergence rates compared with the baselines as the scale increases. Our results convey that IC3Net agents learn when to communicate based on the scenario and profitability.

M3RL: Mind-aware Multi-agent Management Reinforcement Learning

Tianmin Shu, Yuandong Tian

Most of the prior work on multi-agent reinforcement learning (MARL) achieves optimal collaboration by directly learning a policy for each agent to maximize a common reward. In this paper, we aim to address this from a different angle. In particular, we consider scenarios where there are self-interested agents (worker agents) that have their own minds (preferences, intentions, skills, etc.) and cannot be dictated to perform tasks they do not want to do. For achieving optimal coordination among these agents, we train a superagent (the manager) to manage them by inferring their minds based on both current and past observations, and then initiating contracts to assign suitable tasks to workers and promise to reward them with corresponding bonuses so that they will agree to work together. The objective of the manager is to maximize the overall productivity as well as minimize payments made to the workers for ad-hoc worker teaming. To train the manager, we propose Mind-aware Multi-agent Management Reinforcement Learning (M3RL), which consists of agent modeling and policy learning. We have evaluated our approach in two environments, Resource Collection and Crafting, to simulate multi-agent management problems with various task settings and multiple designs for the worker agents. The experimental results have validated the effectiveness of our approach in modeling worker agents’ minds online and in achieving optimal ad-hoc teaming with good generalization and fast adaptation.

Multiple-Attribute Text Rewriting

Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc'Aurelio Ranzato, Y-Lan Boureau

The dominant approach to unsupervised style transfer in text is based on the idea of learning a latent representation, which is independent of the attributes specifying its style. In this paper, we show that this condition is not necessary and is not always met in practice, even with domain adversarial training that explicitly aims at learning such disentangled representations. We thus propose a new model that controls several factors of variation in textual data where this condition on disentanglement is replaced with a simpler mechanism based on back-translation. Our method allows control over multiple attributes, such as gender, sentiment, and product type, and a more fine-grained control on the trade-off between content preservation and change of style with a pooling operator in the latent space. Our experiments demonstrate that the fully entangled model produces better generations, even when tested on new and more challenging benchmarks comprising reviews with multiple sentences and multiple attributes.

No Training Required: Exploring Random Encoders for Sentence Classification

John Wieting, Douwe Kiela

We explore various methods for computing sentence representations from pretrained word embeddings without any training, i.e., using nothing but random parameterizations. Our aim is to put sentence embeddings on more solid footing by 1) looking at how much modern sentence embeddings gain over random methods (as it turns out, surprisingly little); and by 2) providing the field with more appropriate baselines going forward, which are quite strong. We also make important observations about proper experimental protocol for sentence classification evaluation, along with recommendations for future research.

Pay Less Attention with Lightweight and Dynamic Convolutions

Felix Wu, Angela Fan, Alexei Baevski, Yann N. Dauphin, Michael Auli

Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT’14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.

Quasi-Hyperbolic Momentum and Adam for Deep Learning

Jerry Ma, Denis Yarats

Momentum-based acceleration of stochastic gradient descent (SGD) is widely used in deep learning. We propose the quasi-hyperbolic momentum algorithm (QHM) as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step. We describe numerous connections to and identities with other algorithms, and we characterize the set of two-state optimization algorithms that QHM can recover. Finally, we propose a QH variant of Adam called QHAdam, and we empirically demonstrate that our algorithms lead to significantly improved training in a variety of settings, including a new state-of-the-art result on WMT16 EN-DE. We hope that these empirical results, combined with the conceptual and practical simplicity of QHM and QHAdam, will spur interest from both practitioners and researchers. Code is immediately available.

Selfless Sequential Learning

Rahaf Aljundi, Marcus Rohrbach, Tinne Tuytelaars

Sequential learning, also called lifelong learning, studies the problem of learning tasks in a sequence with access restricted to only the data of the current task. In this paper we look at a scenario with fixed model capacity, and postulate that the learning process should not be selfish, i.e., it should account for future tasks to be added and thus leave enough capacity for them. To achieve Selfless Sequential Learning, we study different regularization strategies and activation functions. We find that imposing sparsity at the level of the representation (i.e., neuron activations) is more beneficial for sequential learning than encouraging parameter sparsity. In particular, we propose a novel regularizer, which encourages representation sparsity by means of neural inhibition. It results in few active neurons which, in turn, leaves more free neurons to be utilized by upcoming tasks. As neural inhibition over an entire layer can be too drastic, especially for complex tasks requiring strong representations, our regularizer inhibits only other neurons in a local neighborhood, inspired by lateral inhibition processes in the brain. We combine our novel regularizer with state-of-the-art lifelong learning methods that penalize changes to important previously learned parts of the network. We show that our new regularizer leads to increased sparsity, which translates in consistent performance improvement on diverse datasets.

See code on GitHub.

Spreading Vectors for Similarity Search

Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jegou

Discretizing multidimensional data distributions is a fundamental step of modern indexing methods. State-of-the-art techniques learn parameters of quantizers on training data for optimal performance, thus adapting quantizers to the data. In this work, we propose to reverse this paradigm and adapt the data to the quantizer: we train a neural net which last layer forms a fixed parameter-free quantizer, such as predefined points of a hyper-sphere. As a proxy objective, we design and train a neural network that favors uniformity in the spherical latent space, while preserving the neighborhood structure after the mapping. We propose a new regularizer derived from the Kozachenko–Leonenko differential entropy estimator to enforce uniformity and combine it with a locality-aware triplet loss. Experiments show that our end-to-end approach outperforms most learned quantization methods, and is competitive with the state of the art on widely adopted benchmarks. Furthermore, we show that training without the quantization step results in almost no difference in accuracy, but yields a generic catalyzer that can be applied with any subsequent quantizer. The code is available online.

Unsupervised Hyper-Alignment for Multilingual Word Embeddings

Jean Alaux, Edouard Grave, Marco Cuturi, Armand Joulin

We consider the problem of aligning continuous word representations, learned in multiple languages, to a common space. It was recently shown that, in the case of two languages, it is possible to learn such a mapping without using supervision. In this paper, we propose to extend one of the proposed methods to the problem of aligning multiple languages to a common space. A simple solution to this problem is to independently map all languages to English. Unfortunately, this lead to poor alignments between languages different than English. We thus propose to add constraints to ensure that the learned mapping can be composed, leading to better alignments. We evaluate our method on the problem of aligning word vectors in eleven languages, showing improvement in word translation requiring the composition of mappings.

Value Propagation Networks

Nantas Nardelli, Gabriel Synnaeve, Zeming Lin, Philip H. S. Torr, Pushmeet Kohli, Nicolas Usunier

We present Value Propagation (VProp), a set of parameter-efficient differentiable planning modules built on Value Iteration which can successfully be trained using reinforcement learning to solve unseen tasks, has the capability to generalize to larger map sizes, and can learn to navigate in dynamic environments. We show that the modules enable learning to plan when the environment also includes stochastic elements, providing a cost-efficient learning system to build low-level size-invariant planners for a variety of interactive navigation problems. We evaluate on static and dynamic configurations of MazeBase grid-worlds, with randomly generated environments of several different sizes, and on a StarCraft navigation scenario, with more complex dynamics, and pixels as input.

Other activities at ICLR 2019

Debugging Machine Learning Models Workshop

Cristian Canton, Sam Corbett-Davies, Albert Gordo, Yannis Kalantidis, Madian Khabsa, program committee
Paper: The Scientific Method in the Science of Machine Learning
Michela Paganini & Jessica Forde (Project Jupyter)

Deep Generative Models for Highly Structured Data Workshop

Kyunghyun Cho, organizer

Deep Reinforcement Learning Meets Structured Prediction Workshop

Yuandong Tian, organizer

ICLR 2019 Expo Session

AI Research Using PyTorch: Bayesian Optimization, Billion Edge Graphs, and Private Deep Learning
Max Balandat, Soumith Chintala, Adam Lerer, Luca Wehrstedt, speakers

Learning Representations Using Causal Invariance

Leon Bottou
2:30 p.m. – 3:15 p.m., invited talk

Representation Learning on Graphs and Manifolds Workshop

Maximilian Nickel & Adriana Romero, organizers

Task Agnostic Reinforcement Learning (TARL) Workshop

Amy Zhang, Roberto Calandra, Alessandro Lazaric, Joelle Pineau, organizers
Paper: Insights on Visual Representations for Embodied Navigation Tasks
Julian Straub, Ari Morcos, Dhruv Batra, Erik Wijmans, Judy Hoffman