When we think of AI-powered assistants, we often think of web-based chatbots or stationary smart speakers. At FAIR, we’ve been pursuing generally intelligent embodied AI agents that can perceive and interact with their environment, share that environment safely with human partners, and communicate and assist those human partners in both the digital and physical world. We’re building towards our future vision of all-day wearable augmented reality (AR) glasses, which would include a contextualized AI-powered interface and assistant to help you throughout your day. And we’re also working on improving the technology behind socially intelligent robots that will help out with everyday chores while adapting and personalizing to the preferences of their human partners. This robotics work is primarily focused on going deeper into embedded systems to build the best next-generation AR and VR experiences.
Training and testing social embodied AI agents on physical hardware (whether robots or AR glasses) with actual humans has scalability limitations, requires the added complexity of establishing standardized benchmarking procedures, and may pose safety concerns. That’s why we’ve developed a new set of tools for robotics research across simulators, datasets, and an affordable technology stack encompassing both hardware and software that make it easier, faster, and more affordable to conduct this research.
Today, we’re announcing three major advancements toward the development of social embodied AI agents that can cooperate with and assist humans in their daily lives:
In order to advance robotics capabilities fast, we develop and test new algorithms and models in simulators and then transfer them to physical robots. We’ve been making strides with the Habitat simulator for several years. Habitat 1.0 trained virtual robots to navigate in 3D scans of physical-world houses at speeds exceeding 10,000 robot steps per second (SPS). Habitat 2.0 introduced interactive environments (e.g., objects that could be picked up, drawers that could be opened) and trained virtual robots to clean up the house by rearranging objects.
Habitat 3.0 builds on those advances and supports both robots and humanoid avatars to enable human-robot collaboration on everyday tasks (e.g., tidying up the living room, preparing a recipe in the kitchen). This opens up new avenues for research on human-robot collaboration in diverse, realistic, and visually and semantically rich tasks. Habitat 3.0 also supports human avatars with a realistic appearance, natural gait, and actions to model realistic low- and high-level interactions. These humanoid avatars are controllable both by learned policies as well as real humans using a human-in-the-loop interface. This interface can support different media, such as control via keyboard and mouse as well as through VR headsets. This cohabitation of humans and robots in the simulation environment allows us for the first time to learn robotics AI policies in the presence of humanoid avatars in home-like environments on everyday tasks and evaluate them with real humans-in-the-loop. This is significant on several fronts:
We also present two highly relevant tasks and a suite of baselines to establish benchmarks in the field of social embodied AI. The first task, Social Rearrangement, involves a robot and a humanoid avatar working collaboratively to perform a set of pick-and-place tasks, like cleaning up a house. In this task, the robot and the human must coordinate their actions to achieve a common goal. This intelligent behavior emerges from large-scale training in simulation. The second task, Social Navigation, involves the robot locating and following a person while maintaining a safe distance.
We believe Habitat 3.0 is the first simulator to support large-scale training on human-robot interaction tasks in diverse, realistic indoor environments. This training results in emergent collaborative behaviors in our learned policy, such as giving way to the human partner in narrow corridors and efficiently splitting up a task for faster completion than the human could accomplish alone.
Habitat Synthetic Scenes Dataset (HSSD-200)
3D scene datasets are critical to training robots in simulated environments. While there are a number of simulated 3D environments datasets that allow us to scale training data, we don’t have an understanding of tradeoffs between dataset scale (number of scenes and total scene physical size) and dataset realism (visual fidelity and correlation to physical-world statistics). HSSD-200 is a synthetic 3D scene dataset that more closely mirrors physical scenes compared to prior datasets. It consists of 211 high-quality 3D scenes representing actual interiors and contains a diverse set of 18,656 models of physical-world objects from 466 semantic categories.
HSSD-200 is distinguished from prior work along several axes. It offers high-quality, fully human-authored 3D interiors. It includes fine-grained semantic categorization corresponding to WordNet ontology. And its asset compression enables high-performance embodied AI simulation. Its scenes were designed using the Floorplanner web interior design interface. The layouts are predominantly recreations of actual houses. Individual objects are created by professional 3D artists and, in most cases, match specific brands of actual furniture and appliances.
Our experiments show that the smaller-scale but higher-quality HSSD-200 dataset leads to ObjectGoal navigation (ObjectNav) agents that perform comparably to agents trained on significantly larger datasets. We found that we can train navigation agents with comparable or better generalization to physical-world 3D reconstructed scenes using two orders of magnitude fewer scenes from HSSD-200 than from prior datasets. In fact, training on 122 HSSD-200 scenes leads to agents that generalize better to HM3DSem  physical-world scenes than agents trained on 10,000 ProcTHOR scenes.
Common, shared platforms have been an important part of progress in machine learning in general, but robotics mostly lacks these because of the difficulty of reproducing and scaling hardware results. We identify three goals for a platform for reproducible robotics research:
As a north star, we identify Open-Vocabulary Mobile Manipulation (OVMM)—that is, picking up any object in any unseen environment and placing it in a specified location. This requires very robust long-term perception and scene understanding, which is useful for a wide variety of tasks.
To drive research in this area, we’re introducing the HomeRobot library, which implements navigation and manipulation capabilities supporting Hello Robot’s Stretch. HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments, and a physical-world component, providing a software stack for the low-cost Hello Robot Stretch in addition to the Boston Dynamics spot, in order to encourage replication of physical-world experiments across labs.
HomeRobot is designed as a user-friendly software stack, enabling quick setup of the robot for immediate testing. The key features of our software stack include:
We also introduce the HomeRobot OVMM benchmark, where an agent navigates household environments to grasp novel objects and place them in or on target receptacles. We implemented both reinforcement learning and heuristic (model-based) baselines and showed evidence of simulation-to-physical-world transfer of the nav and place skills. Our baselines achieve a 20% success rate in the physical world. We’re running a Neurips 2023 competition to encourage adoption and grow the community around our new platform.
In recent years, the field of embodied AI research has primarily focused on the study of static environments—working under an assumption that objects in an environment remain stationary. However, in a physical environment inhabited by humans, that’s simply not true. Our vision for socially intelligent robots goes beyond the current paradigm by considering dynamic environments where humans and robots interact with each other and the environment around them. The interaction between humans and robots opens up new problems—and possibilities—such as collaboration, communication, and future state prediction.
While we’ve made considerable progress toward our vision of socially intelligent robots since open sourcing Habitat 1.0 in 2019, there’s still important work to do. In the next phase of our research, we’ll use the Habitat 3.0 simulator to train our AI models so these robots are able to assist their human partners and adapt to their preferences. We’ll use HSSD-200 in conjunction with Habitat 3.0 to collect data on human-robot interaction and collaboration at scale so we can train more robust models. And we’ll focus on deploying the models learned in simulation into the physical world so we can better gauge their performance.