Facebook research at NeurIPS 2020

December 2, 2020

Building AI that can teach itself by paraphrasing sample text. Combining visual and tactile learning for 3D understanding. Creating a single algorithm that can excel at both chess and poker. Modeling complex geometries to help build tools to learn earthquake and flood patterns. These are some of the topics that Facebook AI researchers will present new work on in poster sessions, spotlight presentations, and workshops at the virtual Conference on Neural Information Processing Systems (NeurIPS), one of the largest AI conferences of the year. From Sunday, December 6, through Saturday, December 12, attendees can drop by Facebook’s virtual exhibit booth to meet researchers, try demos, and chat with our recruitment team.

At this year's conference, Facebook AI is sharing work that accelerates progress across a wide range of important areas of AI — from natural language processing (NLP) to speech recognition to embodied AI to multimodal understanding. We’re also hosting competitions tracks to help accelerate scientific research. The Hateful Memes Challenge invites researchers to build new models that detect hate speech in multimodal memes; the fastMRI 2020 reconstruction challenge centers on brain scans. And we’re leading a tutorial on how to build AI with security and privacy in mind.

You can find a list of our papers here. and more details of our participation here. Today we’re sharing details on several of our notable papers that we’ll be presenting at the conference:

Paraphrasing: A new self-supervised learning technique

Most AI systems learn languages using a massive volume of labeled examples for each and every task. How can we reduce this dependency on labels and learn in a more humanlike way? Recently, researchers have found success in training models for one task and then applying the same knowledge for new tasks — also known as pretraining. The most popular pretraining technique today involves masking some words in a passage of text and then teaching systems to fill in the blanks. But this method — learning to fill in missing words in a sentence — is not directly useful for the kinds of tasks we actually want AI to perform, such as translating between languages or understanding and answering a question. For these and other applications, the model still needs additional task-specific training data.

Now we’re introducing an entirely new approach to pretraining, called Multilingual Autoencoder that Retrieves and Generates (MARGE), that works on many tasks without task-specific data. MARGE rivals the performance of the masking method and — best of all — can work for both classification tasks (like question answering) and generation tasks (like summarization or translation) in many languages, making it arguably the most general pretraining method to date. Not only can MARGE do well on some tasks without any additional task-specific data, but it also achieves similar performance to masking when given labeled data.

Instead of masking parts of the documents, MARGE finds related documents and then paraphrases them to reconstruct the original. By finding relevant facts during pretraining, our model can focus on learning to paraphrase rather than memorizing encyclopedic knowledge. MARGE is a step forward in building language models that do more with less training data.

MARGE builds on several of Facebook AI’s recent breakthroughs in self-supervised learning. XLM-R, for instance, is our powerful multilingual model that learns from data in one language and then executes a task in 100 languages. And the Retrieval Augmented Generation (RAG) architecture (which will also be presented at NeurIPS) makes it possible for NLP models to update internal knowledge without having to retrain the model. Read the full paper here.

Demystifying self-supervision in image recognition

Similar to the way self-supervised techniques like XLM-R and RAG have pushed NLP forward, we’ve seen massive jumps in performance for computer vision as well. Just in the last year or so, there’s been an explosion of papers on self-supervised learning in image recognition, including PIRL and MoCo. At NeurIPS, we’re presenting a new paper that uses convolution networks that achieve better results than with previous methods that used 6x the compute power. While the performance gains in this area have come at a breakneck pace (and the practical applications seem to be close), we need to better understand: How exactly do these techniques achieve such strong results with less labeled data?

Understanding what’s driving these successes can help us continue to push forward in the right direction. We know that in image recognition, a good visual representation needs to be able to adapt to varying angles, lighting, and occlusions that are common to real-world photos. Models usually crop and augment images to be able to adapt to these variances. It turns out that, while this cropping method is useful for handling occlusions, it fails to model other crucial variances, like viewpoint and instance changes. Additionally, object-centric biases in training data play a major role in performance gains on standard tasks. To improve the robustness of these models, we introduce a new approach that leverages raw videos to learn different viewpoints of visual representations. This works more effectively than current self-supervised approaches for image recognition. Read the full paper here.

Understanding patterns of earthquakes, floods, and fires

Blue and red dots represent training and testing data points, respectively. The learned spherical distributions along with the training and testing datasets. We note that, qualitatively, the stereographic distribution is generally more spread out than its Riemannian counterpart.

Probabilistic models have been a driving force in advancing AI in recent decades. For some of the most challenging and impactful questions in science, however, this approach requires accounting for more complex spaces in which the data is located. For instance, storm trajectories in climate science follow paths on a sphere. Similarly, cell development processes in biology can follow treelike trajectories. Because the distribution is not well defined, we have to approximate the distribution of the underlying data.

To better tackle these types of problems, we developed a new method to model probability distributions on spaces with such complex structure (also called manifolds). We built on top one of the most powerful probabilistic frameworks, called continuous normalizing flows, and gave it the ability to account for the geometry of the space in which the data is located.

In our method, a probability distribution is constructed by:

  • First, parameterizing a vector field on a manifold via a neural network.

  • Next, sampling particles from a base distribution.

  • Finally, approximating their location on the manifold using a numerical solver.

This allows us to model highly complex distributions on a wide class of manifolds such as the sphere, tori (ringlike), and hyperbolic space (treelike). We tested our new model using real-world and synthetic data, and we achieved substantial improvements in predicting manifold data compared with state-of-the-art methods. Read the full paper here.

Combining vision and touch for 3D understanding

If we want to build AI systems that can interact in and learn from the world around us, touch can be as equally important as sight and speech. If you’re asked about the shape of an object at hand, for instance, you’d typically pick it up and examine it with your hand and eyes at the same time. For AI agents, combining vision and touch could lead to a richer understanding of objects in 3D. But this research area has been underexplored in AI.

We’re introducing a new method to accelerate progress in building AI that leverages two senses together. We simultaneously fuse the strengths of sight and touch to perform 3D shape reconstruction. We did this by creating a new dataset that’s made up of simulated interactions between a robotic hand and 3D objects.

Our approach to 3D shape reconstruction combines a single RGB image with four touch readings. We start by predicting touch charts from touch recordings and projecting the visual signal onto all charts. Then, we feed the charts into an iterative deformation process, where we enforce touch consistency. As a result, we obtain a global prediction of deformed charts.

When a robot hand grasps an object, we receive tactile and pose information of the grasp. We combine this with an image of the object and use a graph convolutional network to predict local charts of information at each touch site. We use the corresponding vision information to predict global charts that close the surface around them in a fill-in-the-blank type of process. This combination of local structure and global context helps us predict 3D shapes with high accuracy.

Our approach outperforms single modality baselines — vision or touch — and it’s also better than baseline methods for multimodal 3D understanding. We also found that the quality of our 3D reconstruction increases with each additional grasp and relevant touch informationRead the full paper here.

Mastering poker and advancing in bridge

Building practical AI systems also requires learning how to interact with people and agents around them, especially when each agent has access to different information about the world. Games are an important scientific arena to benchmark and make progress in learning how to interact in environments where all agents’ actions affect one another. Imperfect-information games — like Texas Hold’em, Contract Bridge, or Hanabi — have been especially challenging, since the value of an action may depend on what each player believes the true state of the world is rather than what the player observes. We’ve advanced AI capabilities in imperfect-information games in two ways.

We are introducing Recursive Belief-based Learning (ReBeL)a general RL+Search algorithm that can work in all two-player zero-sum games, including imperfect-information games. Unlike previous AIs, ReBeL makes decisions by factoring in the probability distribution of different beliefs each player might have about the current state of the game, which we call a public belief state. For example, ReBeL can assess the chances that its poker opponent thinks it has a pair of aces. ReBeL achieves superhuman performance in heads-up no-limit Texas Hold’em while using far less domain knowledge than any prior poker bot. It extends to other imperfect-information games, as well, such as Liar’s Dice, for which we’ve open-sourced our implementation. ReBeL will help us create more general AI algorithms. Read the full paper here.

When learning to play games that involve teamwork and competition, like Contract Bridge, agents often get trapped in poor local minima (or equilibria), where none of the agents are willing to unilaterally change their policies. For example, if speaking one specific language becomes a convention, then switching to a different one is not a good choice for an agent alone, even if the other agent actually knows that language better. We created a novel algorithm that allows multiple agents to change policies more collaboratively. It has proved to never worsen the current policy during the change, and it’s computationally more efficient than brute-force approaches. This algorithm achieved state of the art in bridge bidding, and — more broadly — improves collaboration between AI and humans. Read the full paper here.

Reasoning in a multiple choice test

Just as gaming benchmarks make AI systems more intelligent, IQ tests are another important testbed to improve capabilities in AI. Existing AI systems today can easily be trained to ace multiple choice IQ tests, but it’s not clear whether machines actually learn the answers or simply memorize statistical patterns observed during training. We’ve built the first neural network that turns the challenge on its head — rather than selecting the answer out of multiple choices, we, instead, generate the right answer without seeing the choices. Our algorithm not only develops plausible answers but is also competitive to the state-of-the-art methods in multiple-choice tests. We believe that the ability to generate a correct answer without seeing the options is the ultimate test of understanding the question. It’s a step forward in building smarter, more capable AI systems.

For our research, we use the multiple choice test called Raven’s Progressive Matrices, where participants have to complete the missing location in a 3X3 grid of abstract images. To generate plausible answers, we combine symbolic reasoning and visual understanding in one system. Our neural model combines multiple advances in generative models, including leveraging multiple pathways through the same network and a selective back-propagation procedure. Read the full paper here.

For more papers presented at NeurIPS, visit our event page here.