Computer Vision

Powered by AI: Oculus Insight


To unlock the full potential of virtual reality (VR) and augmented reality (AR) experiences, the technology needs to work anywhere, adapting to the spaces where people live and how they move within those real-world environments. When we developed Oculus Quest, the first all-in-one, completely wire-free VR gaming system, we knew we needed positional tracking that was precise, accurate, and available in real time — within the confines of a standalone headset, meaning it had to be compact and energy efficient.

At last year’s Oculus Connect event we shared some details about Oculus Insight, the cutting-edge technology that powers both Quest and Rift S. Now that both of those products are available, we’re providing a deeper look at the AI systems and techniques that power this VR technology. Oculus Insight marks the first time that fully untethered six-degree-of-freedom (6DoF) headset and controller tracking has shipped in a consumer AR/VR device. Built from the ground up, the Insight stack leverages state-of-the-art computer vision (CV) systems and visual-inertial simultaneous localization and mapping, or SLAM.

Oculus Insight computes an accurate and real-time position for the headset and controllers every millisecond in order to translate your precise movements into VR, so you feel truly present. It uses SLAM to track the user’s headset position, and constellation tracking to track the controller positions. Insight’s CV systems fuse multiple sensor inputs from the headset and controllers to boost the precision, accuracy, and response time of the system’s positional tracking.

Generating real-time maps and position tracking with visual-inertial SLAM

Academic research has been done on SLAM techniques for several decades, but the technology has only recently become mature enough for consumer applications, such as driverless cars and mobile AR apps. Facebook previously released a version of SLAM for AR on mobile devices which uses a single camera and inertial measurement unit (IMU) to track a phone’s position and enable world-locked content — content that’s visually anchored to real objects in the world. Oculus Insight is the second generation of this library, and it incorporates significantly more information from a combination of multiple IMUs and ultra-wide-angle cameras, as well as infrared LEDs to jointly track the 6DoF position of a VR headset and controllers.

The Oculus Insight system uses a custom hardware architecture and advanced computer vision algorithms — including visual-inertial mapping, place recognition, and geometry reconstruction — to establish the location of objects in relation to other objects within a given space. This novel algorithm stack enables a VR device to pinpoint its location, identify aspects of room geometry (such as floor location), and track the positions of the headset and controllers with respect to a 3D map that is generated and constantly updated by Insight. The data used for this process comes from three types of sensors built into the Quest and Rift S hardware:

  1. Linear acceleration and rotational velocity data from IMUs in the headset and controllers are integrated to track the orientation and position of each with low latency.

  2. Image data from cameras in the headset helps generate a 3D map of the room, pinpointing landmarks like the corners of furniture or the patterns on your floor. These landmarks are observed repeatedly, which enables Insight to compensate for drift (a common challenge with IMUs, where even tiny measurement discrepancies build up over time, resulting in inaccurate location tracking).

  3. Infrared LEDs in the controllers are detected by the headset cameras, letting the system bound the controller position drift caused by integrating multiple IMUs.

As you move, Oculus Insight detects pixels in images with high contrast, such as the corners of a window. These high-contrast image regions are tracked and associated over time, from image to image. Given a long enough baseline of observations, Oculus Insight is able to triangulate the 3D position of each point in your surroundings. This forms the basis of the system’s 3D environment map.

The first major applications of SLAM were in robotics, particularly in early generations of self-driving vehicles, which relied on massive amounts of compute resources (multiple onboard PCs) and used sensors that were both expensive (such as navigation-grade IMUs) and power-hungry (such as 3D Lidar systems). More recently, engineers have introduced techniques that allow SLAM to run on less powerful hardware, including on mobile phones for AR effects in games and camera filters. But while some amount of lag or positional inaccuracy is acceptable for handheld, phone-based AR, untethered VR requires an unprecedented level of speed, accuracy, and precision for a consumer SLAM application. That’s because in handheld AR, the 3D effect occupies a small area of the user’s overall field of view, and content can be time-synchronized to reduce latency. By contrast, content in VR occupies the user’s entire field of view, and the system needs to respond as quickly as the user can move. This makes any potential tracking errors far more noticeable and significantly increases the difficulty for Insight running on Oculus Quest and Rift.

Something Went Wrong
We're having trouble playing this video.

SLAM addresses these challenges by automatically recognizing features in the environment, letting Oculus Insight incorporate the player’s current position into a VR display. Insight also uses an extrapolation function with dynamic damping to help predict where the user’s head and hands will move in the milliseconds ahead. This provides a number of benefits, including reducing the visual stuttering effect known as jitter, which is the key metric that tracking systems are measured against. To help enable a comfortable VR experience, tracking should be in the submillimeter range, meaning that the system can track with precision greater than a single millimeter. Insight exceeds this target in most environments.

Another major factor to avoid in delivering immersive experiences is latency — any lag between physical movements and their VR equivalents can disorient the user and degrade the sense of realism. By using low-latency IMU data and a kinematic model that predicts a user’s motion into the future, Insight is able to effectively eliminate the apparent latency. We’ll go into more detail in the next section about the sensor fusion process that incorporates SLAM data, but reducing both jitter and latency is central to Insight’s ability to deliver a new level of realism within VR.

And SLAM also has subtler benefits, such as helping to minimize “swimminess,” the disorienting feeling that can result from physical movements that aren’t correctly translated into corresponding movements in VR, such as a sword swing that travels too fast or extends too far. This issue is distinct from the delayed motions created by latency, and Insight relies on its tracking precision to avoid the swimminess sometimes caused by disparities between real-world movements and VR motion.

Improving accuracy through motion capture and device simulation

Building a system that addresses such a wide range of potential issues, and in a product robust enough for the consumer market, meant tackling a similarly broad array of technical challenges. Most of those related to two goals: accuracy and efficiency.

Some aspects of accuracy are easily quantified, such as the submillimeter-level precision that we achieved to help reduce jitter. But other experiences, such as swimminess, are based at least partially on the user’s subjective point of view — a swing of a tennis racket that might appear to travel too fast or too far but that external, sensor-based measurements won’t necessarily register as an error. Quantifying — and ultimately closing — the gap between physical and VR movements required a new approach to measuring those discrepancies.

Our solution was extensive analysis of sensor data measured against data from external motion capture, collected by arrays of OptiTrack cameras. Similar to what’s used in Hollywood VFX productions, these cameras were positioned in Facebook work spaces as well as in employees’ real-world homes. The OptiTrack systems would track the illuminators placed on participants’ HMDs and controllers and throughout each testing environment. This allowed us to compute the exact ground-truth 3D position of the Quest and Rift S users and then compare those measurements to where Oculus Insight’s positional tracking algorithm thought they were. We then tuned that algorithm based on the potential discrepancies in motion-capture and positional data, improving the system by testing in hundreds of environments that featured different lighting, decorations, and room sizes, all of which can impact the accuracy of Oculus Insight.

In addition to using these physical testing environments, we also developed automated systems that replayed thousands of hours of recorded video data and flagged any changes in the system performance while viewing a given video sequence. And because Quest uses a mobile chipset, we built a model that simulates the performance of mobile devices while running on a general server computer, such as the machines in Facebook data centers. This enabled us to conduct large-scale replays with results that were representative of Quest’s actual performance, improving Insight’s algorithms within the constraints of the HMD it would have to operate on.

Designing ultraefficient, device-optimized CV

Though accuracy was important to Insight’s positional tracking, so was the need for an unprecedented degree of CV efficiency. Oculus Quest renders real-time, high-end graphics at resolutions and frame rates comparable to, and in some cases higher than, those of PC and console games — on a compute system that consumes two orders of magnitude less power than a PC or console. At the same time, it also runs a real-time SLAM and controller tracking system. Running all of this on an order of magnitude less compute power, two orders of magnitude less total power, and substantially less memory bandwidth than a modern PC was a significant systems challenge.

There was no single approach to streamlining this compute pipeline but rather a range of multithreaded adjustments designed to run on-device, with many operations happening asynchronously.

Oculus Insight processes multiple threads of data at once, in real-time — the mapper thread modifies the map, sending updated copies to the tracker thread, which uses camera frames to estimate poses in the mapper-provided frames, while the IMU thread uses measurements from the IMUs to update the latest SLAM state.

For example, we employed Quest-specific digital signal processing (DSP) optimizations that included asynchronous map updating, allowing the system to refine and update maps based on changes in the user’s environment, but in the background. Meanwhile, IMU data runs on its own higher-priority thread, with the output stored in a shared memory buffer to minimize system latency.

The future of spatial AI

Oculus Insight is the foundation for the new, wireless state of the art in AR and VR, providing Quest users with the untethered power- and compute-efficient precision to keep them within the playspace boundaries they have set while avoiding real-world obstacles. This work could have broad implications for researchers exploring SLAM, as well as for any system that benefits from low-resource, high-accuracy room mapping, such as digital assistants and physical robots.

This tracking system is also part of our longer-term vision to incorporate spatial AI technology into all the connected devices and platforms that Facebook is building. So far, our spatial AI applications include Oculus Insight, which demonstrates that this approach is viable in consumer applications, and the photorealistic 3D reconstructions that Facebook Reality Labs created for the research-oriented Replica dataset. But the future for this technology is in all-day wearable AR glasses that are spatially aware. This will require running SLAM with even greater constraints, including further reducing latencies and cutting power consumption down to as little as 2 percent of what’s needed for SLAM on an HMD. Clearing these hurdles will require hardware innovations as well as the development of AI to further optimize the process of synthesizing multiple sensor inputs. The ultimate goal of this work is to deliver AR and VR experiences that are not only more immersive but also integrated into the physical world.

Written By

Joel Hesch

Engineering Manager

Anna Kozminski

Software Program Manager

Oskar Linde

Lead Machine Perception Architect