June 16, 2022
To create and interact with immersive new experiences in the metaverse — where people can navigate virtual worlds as well as our physical world with augmented reality — AI systems must learn to move through the complexities of the physical world as people do. AR glasses that show us where we left our keys, for example, require foundational new technologies that help AI understand the layout and dimensions of unfamiliar, ever-changing environments without high-compute resources, like preprovided maps. As humans, for example, we don’t need to learn the precise location or length of our coffee table to be able to walk around it without bumping into its corners (most of the time).
Today, we're announcing new scientific research that will help AI learn much more flexibly and efficiently to understand the physical world. Collectively, this body of work pushes advancements in visual navigation of embodied AI, a research area focused on training AI systems through interactions in 3D simulations rather than traditional 2D datasets.
We’ve built a point-goal navigation modelthat can navigate an entirely new environment without requiring a pre-provided map or a GPS sensor. We did this using Habitat 2.0, our state-of-the-art embodied AI platform that runs simulations orders of magnitude faster than real time
To further improve training without reliance on maps, we’ve created and released Habitat-Web, a training data collection of over 100K different human demonstrations for object-goal navigation methods. For each human demonstration, a paid Mechanical Turk user is provided a task instruction (e.g., “find the chest of drawers”) and teleoperates the virtual robot through a web browser interface on their computer.
We've developed the first "plug and play" modular approach that helps robots generalize to a diverse set of semantic navigation tasks and goal modalities---without retraining---in a novel zero-shot experience learning framework.
And we’re continuing to push for efficiency with a novel formulation for object-goal navigation tasks that obtains state-of-the-art results while achieving a 1,600x reduction in training time compared to prior methods.
Meta AI has been committed to long-term investments in the burgeoning field of embodied AI. Since we introduced Habitat three years ago, we’ve effectively solved the task of point-goal navigation using only an RGB-D camera, GPS, and compass data, and we’ve successfully tested our model with tasks in real-world physical settings using our PyRobot platform. Now, our latest advancements use the power of high-speed simulations to build more flexible, efficient navigation systems.
Models achieving high performance at point-goal navigation often require access to GPS sensors in simulation. These models do not transfer well to physical robots, because GPS data can be noisy, unreliable, or unavailable in indoor spaces.
We’ve developed new methods to improve the way AI tracks its location solely from visual inputs, also known as visual odometry. Our new data-augmentation technique trains simple but highly effective neural models without human data annotations. Robust visual odometry is all you need to push the state of the art from 71.7 percent success to 94 percent success on the Realistic PointNav task without GPS or compass data, under noisy action dynamics.
While our approach does not yet completely solve this dataset, this research provides evidence to support that explicit mapping may not be necessary for navigation, even in realistic settings. Read the paper here..
In another path toward more efficient, map-free learning methods, we have shown how to scale the paradigm of imitation learning from human demonstrations, which — until now — hasn’t been possible, as there hasn’t been a large enough data collection of human demonstrations. To fill this gap, we built Habitat-Web, a new data collection infrastructure for embodied AI, connecting our Habitat simulator running in a web browser to Mechanical Turk and allowing remote users to teleoperate virtual robots safely and at scale. We’ve collected an order of magnitude larger human demo data than existing datasets in simulation and two orders of magnitude larger than existing datasets on real robots.
Agents trained with imitation learning on this data achieve state-of-the-art results, and more important, learn efficient object-search behavior from humans — peeking into rooms, checking corners for small objects, and turning in place to get a panoramic view.
None of these is exhibited as prominently by reinforcement learning (RL) agents, and enabling these types of behaviors in RL agents would require tedious, dense reward energy. It shows that for ObjectNav, a single human demonstration appears to be worth at least five agent-gathered ones from RL.Read the paper here.
When it comes to training AI to find objects, most embodied AI advancements work well on separate, well-defined tasks based on goal type (e.g., “find an object,” “navigate to a room”) or modality (e.g., text, audio). But to work well in the dynamic real world, agents need to adapt their skills on the fly without resource-intensive maps or lengthy retraining processes. In a first-of-its-kind zero-shot experience learning (ZSEL) framework, our model is trained once to capture the essential skills for semantic visual navigation and then applied to different target tasks without additional retraining in a 3D environment.
ZSEL works using a general-purpose semantic search policy that captures the essential navigation skills. This policy is trained by searching for image-goals, where an agent receives a picture taken from a random location in the environment and must travel to find it. Our approach requires up to 12.5x less training data and has up to a 14 percent better success rate than the state of the art in transfer learning. Over five navigation tasks, our ZSEL method saves more than 500 million training interactions and about six weeks of GPU compute required by the task-specific policies learned from scratch.Read the paper here.
RL has been the predominant method of training virtual embodied agents, but this method notoriously requires significant computational resources and time for learning. We’ve figured out a new method to train object-goal navigation policies more efficiently than prior work while incurring up to 1,600x less computational cost for training via experiments on Gibson and Matterport3D. We do this using Potential Functions for ObjectGoal Navigation with Interaction-free Learning (PONI), a new paradigm for learning modular ObjectNav policies that disentangles the object search skill (i.e., where to look for an object) and the navigation skill (i.e., how to navigate to X, Y).
Our key insight is that “where to look?” can be treated purely as a perception problem and learned without interactions. Our network predicts two complementary potential functions conditioned on a semantic map and uses them to decide where to look for an unseen object. PONI not only improves over the state of the art for this task but also makes it significantly easier for researchers to achieve this. Read the paper here.
In the near future, we’ll work on pushing these advancements from navigation to mobile manipulation to build agents that carry out specific tasks, like “find my wallet and bring it back to me.”
We’ll also tackle a host of new exciting challenges: How does this work in simulations carry over to physical robots? How can an embodied agent learn in a self-supervised manner without any human involvement in the form of reward engineering, demonstrations, or 3D annotations? And how do we scale simulation to the next order of magnitude of simulation and learning speed?