Embodied AI
Introducing Habitat 3.0: The next milestone on the path to socially intelligent robots
October 20, 2023
5 minute read

When we think of AI-powered assistants, we often think of web-based chatbots or stationary smart speakers. At FAIR, we’ve been pursuing generally intelligent embodied AI agents that can perceive and interact with their environment, share that environment safely with human partners, and communicate and assist those human partners in both the digital and physical world. We’re building towards our future vision of all-day wearable augmented reality (AR) glasses, which would include a contextualized AI-powered interface and assistant to help you throughout your day. And we’re also working on improving the technology behind socially intelligent robots that will help out with everyday chores while adapting and personalizing to the preferences of their human partners. This robotics work is primarily focused on going deeper into embedded systems to build the best next-generation AR and VR experiences.

Training and testing social embodied AI agents on physical hardware (whether robots or AR glasses) with actual humans has scalability limitations, requires the added complexity of establishing standardized benchmarking procedures, and may pose safety concerns. That’s why we’ve developed a new set of tools for robotics research across simulators, datasets, and an affordable technology stack encompassing both hardware and software that make it easier, faster, and more affordable to conduct this research.

Today, we’re announcing three major advancements toward the development of social embodied AI agents that can cooperate with and assist humans in their daily lives:

  1. Habitat 3.0: The highest-quality simulator that supports both robots and humanoid avatars and allows for human-robot collaboration in home-like environments. AI agents trained with Habitat 3.0 learn to find and collaborate with human partners at everyday tasks like cleaning up a house, thus improving their human partner’s efficiency. These AI agents are evaluated with real human partners using a simulated human-in-the-loop evaluation framework, also provided with Habitat 3.0.
  2. Habitat Synthetic Scenes Dataset (HSSD-200): An artist-authored 3D dataset of over 18,000 objects across 466 semantic categories in 211 scenes. The highest quality dataset of its kind, HSDD-200 can train navigation agents with comparable or better generalization to physical-world 3D reconstructed scenes using two orders of magnitude fewer scenes than from prior datasets.
  3. HomeRobot: An affordable home robot assistant hardware and software platform in which the robot can perform open vocabulary tasks in both simulated and physical-world environments.

Habitat 3.0

In order to advance robotics capabilities fast, we develop and test new algorithms and models in simulators and then transfer them to physical robots. We’ve been making strides with the Habitat simulator for several years. Habitat 1.0 trained virtual robots to navigate in 3D scans of physical-world houses at speeds exceeding 10,000 robot steps per second (SPS). Habitat 2.0 introduced interactive environments (e.g., objects that could be picked up, drawers that could be opened) and trained virtual robots to clean up the house by rearranging objects.

Habitat 3.0 builds on those advances and supports both robots and humanoid avatars to enable human-robot collaboration on everyday tasks (e.g., tidying up the living room, preparing a recipe in the kitchen). This opens up new avenues for research on human-robot collaboration in diverse, realistic, and visually and semantically rich tasks. Habitat 3.0 also supports human avatars with a realistic appearance, natural gait, and actions to model realistic low- and high-level interactions. These humanoid avatars are controllable both by learned policies as well as real humans using a human-in-the-loop interface. This interface can support different media, such as control via keyboard and mouse as well as through VR headsets. This cohabitation of humans and robots in the simulation environment allows us for the first time to learn robotics AI policies in the presence of humanoid avatars in home-like environments on everyday tasks and evaluate them with real humans-in-the-loop. This is significant on several fronts:

  • Reinforcement learning algorithms typically require millions of iterations to learn something meaningful, so it can take years to do these experiments in the physical world. In simulation, they can be completed in a few days.
  • It’s impractical to collect data in different houses in the physical world, as that requires moving the robots to different places, setting up the environment, etc. In simulation, we can change the environment in a fraction of a second and start experimenting in a new environment.
  • If the model isn’t trained well, there’s a risk that the robot could damage the environment or harm people in the physical world. Simulation allows us to test the methods in a safe environment before deploying them to the physical world to help mitigate those safety concerns.
  • Today’s state-of-the-art AI models require a lot of data for training purposes. Simulation enables us to easily scale up data collection, while in the physical world it can be quite costly and slow.

We also present two highly relevant tasks and a suite of baselines to establish benchmarks in the field of social embodied AI. The first task, Social Rearrangement, involves a robot and a humanoid avatar working collaboratively to perform a set of pick-and-place tasks, like cleaning up a house. In this task, the robot and the human must coordinate their actions to achieve a common goal. This intelligent behavior emerges from large-scale training in simulation. The second task, Social Navigation, involves the robot locating and following a person while maintaining a safe distance.

We believe Habitat 3.0 is the first simulator to support large-scale training on human-robot interaction tasks in diverse, realistic indoor environments. This training results in emergent collaborative behaviors in our learned policy, such as giving way to the human partner in narrow corridors and efficiently splitting up a task for faster completion than the human could accomplish alone.

Habitat Synthetic Scenes Dataset (HSSD-200)

3D scene datasets are critical to training robots in simulated environments. While there are a number of simulated 3D environments datasets that allow us to scale training data, we don’t have an understanding of tradeoffs between dataset scale (number of scenes and total scene physical size) and dataset realism (visual fidelity and correlation to physical-world statistics). HSSD-200 is a synthetic 3D scene dataset that more closely mirrors physical scenes compared to prior datasets. It consists of 211 high-quality 3D scenes representing actual interiors and contains a diverse set of 18,656 models of physical-world objects from 466 semantic categories.

HSSD-200 is distinguished from prior work along several axes. It offers high-quality, fully human-authored 3D interiors. It includes fine-grained semantic categorization corresponding to WordNet ontology. And its asset compression enables high-performance embodied AI simulation. Its scenes were designed using the Floorplanner web interior design interface. The layouts are predominantly recreations of actual houses. Individual objects are created by professional 3D artists and, in most cases, match specific brands of actual furniture and appliances.

Our experiments show that the smaller-scale but higher-quality HSSD-200 dataset leads to ObjectGoal navigation (ObjectNav) agents that perform comparably to agents trained on significantly larger datasets. We found that we can train navigation agents with comparable or better generalization to physical-world 3D reconstructed scenes using two orders of magnitude fewer scenes from HSSD-200 than from prior datasets. In fact, training on 122 HSSD-200 scenes leads to agents that generalize better to HM3DSem [43] physical-world scenes than agents trained on 10,000 ProcTHOR scenes.


Common, shared platforms have been an important part of progress in machine learning in general, but robotics mostly lacks these because of the difficulty of reproducing and scaling hardware results. We identify three goals for a platform for reproducible robotics research:

  • A motivating north star: It must provide some guiding north-star tasks that can motivate researchers, help shape their work, and allow for comparisons of a variety of methods on interesting, real-world problems;
  • Software capability: It should provide a number of abstract interfaces that make the robot easier to use for a wide variety of tasks, including navigation and manipulation; and
  • Community: We should incentivize people to get involved, use the codebase, and attempt to build up a community around it.

As a north star, we identify Open-Vocabulary Mobile Manipulation (OVMM)—that is, picking up any object in any unseen environment and placing it in a specified location. This requires very robust long-term perception and scene understanding, which is useful for a wide variety of tasks.

To drive research in this area, we’re introducing the HomeRobot library, which implements navigation and manipulation capabilities supporting Hello Robot’s Stretch. HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments, and a physical-world component, providing a software stack for the low-cost Hello Robot Stretch in addition to the Boston Dynamics spot, in order to encourage replication of physical-world experiments across labs.

HomeRobot is designed as a user-friendly software stack, enabling quick setup of the robot for immediate testing. The key features of our software stack include:

  • Transferability: Unified state and action spaces between simulation and physical-world settings for each task, providing an easy way to control a robot with either high-level action spaces (e.g., pre-made grasping policies) or low-level continuous joint control.
  • Modularity: Perception and action components to support high-level states (e.g., semantic maps, segmented point clouds) and high-level actions (e.g., go to goal position, pick up target object).
  • Baseline Agents: Policies that use these capabilities to provide basic functionality for OVMM, as well as tools to build more complex agents that other teams can build upon.

We also introduce the HomeRobot OVMM benchmark, where an agent navigates household environments to grasp novel objects and place them in or on target receptacles. We implemented both reinforcement learning and heuristic (model-based) baselines and showed evidence of simulation-to-physical-world transfer of the nav and place skills. Our baselines achieve a 20% success rate in the physical world. We’re running a Neurips 2023 competition to encourage adoption and grow the community around our new platform.

What’s Next?

In recent years, the field of embodied AI research has primarily focused on the study of static environments—working under an assumption that objects in an environment remain stationary. However, in a physical environment inhabited by humans, that’s simply not true. Our vision for socially intelligent robots goes beyond the current paradigm by considering dynamic environments where humans and robots interact with each other and the environment around them. The interaction between humans and robots opens up new problems—and possibilities—such as collaboration, communication, and future state prediction.

While we’ve made considerable progress toward our vision of socially intelligent robots since open sourcing Habitat 1.0 in 2019, there’s still important work to do. In the next phase of our research, we’ll use the Habitat 3.0 simulator to train our AI models so these robots are able to assist their human partners and adapt to their preferences. We’ll use HSSD-200 in conjunction with Habitat 3.0 to collect data on human-robot interaction and collaboration at scale so we can train more robust models. And we’ll focus on deploying the models learned in simulation into the physical world so we can better gauge their performance.

Written by:
Dhruv Batra
Research Director
Roozbeh Mottaghi
Research Scientist Manager
Akshara Rai
Research Scientist
Christopher Paxton
Research Scientist
Alexander William Clegg
Research Engineer

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Computer Vision
Introducing Segment Anything: Working toward the first foundation model for image segmentation
April 5, 2023
MultiRay: Optimizing efficiency for large-scale AI models
November 18, 2022
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023