January 25, 2021
People are highly efficient at learning simple, everyday tasks — we can learn how to pick up or place a bottle on a table, for instance, just by watching a person demonstrate the task a few times. But to learn how to manipulate such objects, machines typically need hand-programmed rewards for each task. To teach a robot to place a bottle, we’d first have to tailor its reward so it learns to move the bottle upright over the table. Then we’d have to give it a separate reward focused on teaching it to put the bottle down. This is a slow and tedious iterative process that’s not conducive to real-world use, and, ultimately, we want to create AI systems that can learn in the real-world as efficiently as people can.
As a step toward this goal, we’ve created (and open-sourced) a new technique that teaches robots to learn in this manner — from just a few visual demonstrations. Rather than using pure trial and error, we trained a robot to learn a model of its environment, observe human behavior, and then infer an appropriate reward function. This is the first work to use this method — model-based inverse reinforcement learning (IRL) — using visual demonstrations on a physical robot. Most prior research using IRL has been done in simulation, where the robot already knows its surroundings and understands how its actions will change its environment. It's a much harder challenge for AI to learn and adapt to the complexities and noise of the physical world, and this capability is an important step toward our goal of building smarter, more flexible AI systems.
This achievement centers on a novel visual dynamics model, using a mix of learning from demonstration and self-supervision techniques. We also introduce a gradient-based IRL algorithm that optimizes cost functions by minimizing the distance between execution of a policy and the visual demonstrations.
The objective of IRL is to learn reward functions so that the result of the policy optimization step matches the visual demonstrations well. To achieve this in a sample-efficient manner, model-based IRL utilizes a model both to simulate how the policy will change the environment and to optimize the policy.
One of the biggest challenges in IRL is to find an objective that can be used to optimize the reward function. The effect of changing the reward signal can only be measured indirectly: First, a new policy has to be learned, and then the policy has to be simulated to predict the visual changes of the environment. Only after the second step can we compare the predicted visual changes with the visual demonstration. So, how can we update reward function parameters to bring the predicted visual trajectory closer to the visual demonstration?
To solve this, we view model-based IRL as a bi-level optimization problem. Bi-level optimization problems are characterized by an outer loop optimization that depends on the results of a nested optimization problem. In our case, the outer optimization step adapts the reward; the inner (nested) optimization step optimizes the policy. Reframing IRL in this way enables us to leverage progress on gradient-based bi-level optimization to learn reward functions (paper here) by differentiating through policy optimization.
An important ingredient to our algorithm is a model that can predict changes in visual observations. Since most prior work assumes this dynamic model (of the environment and the robot) is known, we needed to train the robot to learn such a model. To do this, we train keypoint detectors using self-supervised learning techniques, which extract low-dimensional vision features both on human demonstrations and on the robot motions. We then pretrain a model with which the robot can predict how its actions change this low-dimensional feature representation. Using its own visual dynamics model the robot can now optimize a policy (in our case, an action sequence) that maximizes the current reward function via gradient descent.
This research brings us closer to building AI that learns a range of tasks from a few visual demonstrations — and with less reliance on labeled data. It’s a step toward improving the way AI learns to learn. Still, many open challenges remain. As a next step, we’re researching ways to make our visual predictive model more robust, especially against different viewpoints. We’re also exploring various starting configurations and ways to generalize our approach from one context to another. With additional research, for instance, we could use model-based IRL to build AI systems that learn a wide-range of skills just by observing videos.
Learning from limited demonstrations is among the hardest challenges of AI today. But it’s also one of the most important steps in building more intelligent AI systems. We believe self-supervised learning is the next frontier of AI. This work on training a visual dynamics model using self-supervision techniques provides an important test bed for us to push self-supervision forward. By combining cutting-edge research in self-supervised learning and gradient-based optimization, we’ve shown that it’s possible for robots to learn how the bottle should move without their being explicitly told how to move it. In the future, our new IRL algorithm can be applied beyond robotics manipulation and push AI systems — more broadly — toward learning with more sample efficiency.