Takeaways
Imagine a pair of stylish, lightweight glasses that combined contextualized AI with a display that could seamlessly give you access to real-time information when you need it and proactively help you as you go about your day. In order for such a pair of augmented reality (AR) glasses to become reality, the system must be able to understand the layout of your physical environment and how the world is shaped in 3D. That understanding would let AR glasses tailor content to you and your individual context, like seamlessly blending a digital overlay with your physical space or giving you turn-by-turn directions to help you navigate unfamiliar locations.
However, building these 3D scene representations is a complex task. Current MR headsets like Meta Quest 3 create a virtual representation of physical spaces based on raw visual data from cameras or 3D sensors. This raw data is converted into a series of shapes that describe distinct features of the environment, like walls, ceilings, and doors. Typically, these systems rely on pre-defined rules to convert the raw data into shapes. Yet that heuristic approach can often lead to errors, especially in spaces with unique or irregular geometries.
Introducing SceneScript
Today, Reality Labs Research is announcing SceneScript, a novel method of generating scene layouts and representing scenes using language.
Rather than using hard-coded rules to convert raw visual data into an approximation of a room’s architectural elements, SceneScript is trained to directly infer a room’s geometry using end-to-end machine learning.
This results in a representation of physical scenes which is compact, reducing memory requirements to only a few bytes; complete, resulting in crisp geometry, similar to scalable vector graphics; and importantly, interpretable, meaning that we can easily read and edit those representations.
How is SceneScript trained?
Large language models (LLMs) like Llama operate using a technique called next token prediction, in which the AI model predicts the next word in a sentence based on the words that came before it. For example, if you typed the words, “The cat sat on the...,” the model would predict that the next word is likely to be “mat” or “floor.”
SceneScript leverages the same concept of next token prediction used by LLMs. However, instead of predicting a general language token, the SceneScript model predicts the next architectural token, such as ‘wall’ or ‘door.’
By giving the network a large amount of training data, the SceneScript model learns how to encode visual data into a fundamental representation of the scene, which it can then decode into language that describes the room layout. This allows SceneScript to interpret and reconstruct complex environments from visual data and create text descriptions that effectively describe the structure of the scenes that it analyzes.
However, the team required a substantial amount of data to train the network and teach it how physical spaces are typically laid out—and they needed to ensure they were preserving privacy.
This presented a unique challenge.
Training SceneScript in simulation
While LLMs rely on vast amounts of training data that typically comes from a range of publicly available text sources on the web, no such repository of information yet exists for physical spaces at the scale needed for training an end-to-end model. So the Reality Labs Research team had to find another solution.
Instead of relying on data from physical environments, the SceneScript team created a synthetic dataset of indoor environments, called Aria Synthetic Environments. This dataset comprises 100,000 completely unique interior environments, each described using the SceneScript language and paired with a simulated video walking through each scene.
The video rendered through each scene is simulated using the same sensor characteristics as Project Aria, Reality Labs Research’s glasses for accelerating AI and ML research. This approach allows the SceneScript model to be completely trained in simulation, under privacy-preserving conditions. The model can then be validated using physical-world footage from Project Aria glasses, confirming the model’s ability to generalize to actual environments.
Last year, we made the Aria Synthetic Environments dataset available to academic researchers, which we hope will help accelerate public research within this exciting area of study.
Extending SceneScript to describe objects, states, and complex geometry
Another of SceneScript’s strengths is its extensibility.
Simply by adding a few additional parameters to scene language that describes doors in the Aria Synthetic Environments dataset, the network can be trained to accurately predict the degree to which doors are open or closed in physical environments.
Additionally, by adding new features to the architectural language, it’s possible to accurately predict the location of objects and—further still—decompose those objects into their constituent parts.
For example, a sofa could be represented within the SceneScript language as a set of geometric shapes including the cushions, legs, and arms. This level of detail could eventually be used by designers to create AR content that is truly customized to a wide range of physical environments.
Accelerating AR, pushing LLMs forward, and advancing the state of the art in AI and ML research
SceneScript could unlock key use cases for both MR headsets and future AR glasses, like generating the maps needed to provide step-by-step navigation for people who are visually impaired, as demonstrated by Carnegie Mellon University in 2022.
SceneScript also gives LLMs the vocabulary necessary to reason about physical spaces. This could ultimately unlock the potential of next-generation digital assistants, providing them with the physical-world context necessary to answer complex spatial queries. For example, with the ability to reason about physical spaces, we could pose questions to a chat assistant like, “Will this desk fit in my bedroom?” or, “How many pots of paint would it take to paint this room?” Rather than having to find your tape measure, jot down measurements, and do your best to estimate the answer with some back-of-the-napkin math, a chat assistant with access to SceneScript could arrive at the answer in mere fractions of a second.
We believe SceneScript represents a significant milestone on the path to true AR glasses that will bridge the physical and digital worlds. As we dive deeper into this potential at Reality Labs Research, we’re thrilled at the prospect of how this pioneering approach will help shape the future of AI and ML research.
Learn more about SceneScript here.
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.
Join us in the pursuit of what’s possible with AI.
Foundational models
Latest news
Foundational models