Project CAIRaoke: Building the assistants of the future with breakthroughs in conversational AI

February 23, 2022

If we could interact with an AI assistant in natural, conversational language, the same way we interact with people, it could make our lives easier in countless ways. But assistants today are underwhelming, whether we’re interacting with them via voice or text. They are easily stumped by routine requests like “Silence all notifications for the rest of the day, unless it’s my mom calling,” let alone questions like “Can I rent the local community center for a private party?” or tasks like “Plan an affordable family beach vacation for the Fourth of July weekend.”

It’s time for better conversational AI.

To get us there, we’re excited to announce Project CAIRaoke. We developed an end-to-end neural model that can power much more personal and contextual conversations than the systems people are familiar with today. We’re already using the model that resulted from Project CAIRaoke in one of our products, Portal, and we aim to integrate it with augmented and virtual reality devices to enable immersive, multimodal interactions with assistants in the future.

Perhaps the largest obstacle to better conversational AI has been the architecture that powers even today’s most advanced assistants. Even though these systems provide one single service, they actually rely on four separate components: natural language understanding (NLU), dialog state tracking (DST), dialog policy (DP) management, and natural language generation (NLG). These distinct AI systems must then be linked together, which makes them difficult to optimize, poor at adapting to new or unfamiliar tasks, and highly dependent on labor-intensive annotated data sets.

This is one reason why the digital assistants that power most devices today keep the user in a box with limited options, forget the context of conversation, and follow mostly prescribed dialog flows. You might be able to ask an assistant for the local weather forecast, for example, but it will be flummoxed if you follow up with something simple but unexpected, like “Is it hotter than it was last week?”

With models created with Project CAIRaoke, people will be able to talk naturally with their conversational assistants, so they can refer back to something from earlier in a conversation, change topics altogether, or mention things that rely on understanding complex, nuanced context. They will also be able to interact with them in new ways, such as by using gestures.

Something Went Wrong
We're having trouble playing this video.

We’ve begun using the model on Portal, Meta’s video calling device, to make it easier to create and manage reminders. For example, you can quickly clarify a request like the following without having to repeat it:

👩‍: Set a reminder for 6:30.

✅ : Is that in the morning or evening?

👩‍: In the evening and let’s call it buy eggs.

✅ : OK, your reminder to buy eggs is set for 6:30 PM tomorrow.

Even in this early test, we believe the model outperforms standard approaches. On Portal, we observed a significant improvement compared with our existing approach in the evaluation of the reminders domain as measured by the success rate of completing a set of reminders goals, while maintaining on-par number of turns.

But this is just a first step toward leveraging this new technology. We believe that the progress made with Project CAIRaoke will enable us to deliver richer communication between people and AI that will be an essential tool as we build for the metaverse. A Project CAIRaoke-powered assistant built into AR glasses could one day follow along in many new and useful ways. For example, if you asked, “What goes with these pants?” it could respond, “Here’s a shirt in your favorite color, red,” and show an image of an item it found for you. And if you said, “I like it, but the stripes are too broad,” it would show you a pinstriped version instead.

In the future, we hope to leverage models that result from this project in everyday applications like this for millions of people around the world.

Building truly interactive conversational AI

One necessary step in advancing conversational AI is understanding the full scope of the problem. Many people see the numerous recent advances in NLU, such as BART and GPT-3, and think the challenge of understanding and generating human-like text has been solved. To discern why we’re not there yet, we have to tease apart AI for understanding and AI for interaction. The former is well researched and developed across the industry. It’s used to extract meaning from various input modalities, such as automatic speech recognition, image classification, and NLU. The latter is how we use our understanding of the world to interact with other people using technology. This can be sending a text, a voice command, haptic feedback, showing an image, a video, an avatar face, or a combination of all these.

Researchers and engineers across the industry agree that good conversational systems need a solid understanding layer powered by AI models. But many feel interaction is an engineering problem, rather than an AI problem. Hence an engineer who knows the state of the world can create an elaborate logic to handle the required interaction. The engineering approach makes it easy to understand how the system works and to quickly debug the logic when necessary. Yet this common belief leads to less robust conversational AI — which is a major reason why you can’t easily plan your vacation through such assistants.

A new, unified approach

These sample dialogs show key skills we want assistants to have: Not just providing accurate, up-to-date real-world knowledge but also working multimodally (in this case, across vision and speech), working across domains (sending a message and also estimating your time of arrival), and letting you drive the conversation rather than needing to conform to a rigid conversational template.

The canonical approach for AI-powered assistants requires four sets of inputs and outputs — one for each layer of the pipeline (NLU, DST, DP, and NLG). And it also requires defined standards for inputs and outputs for each layer. For example, for NLU, a traditional conversational AI system requires defined ontologies (e.g., various intents and entities).

Our model, however, uses a neural network and doesn’t prescribe conversational flow at all. With this model, we need just one set of training data.

Something Went Wrong
We're having trouble playing this video.

Project CAIRaoke reduces the work required to add a new domain. In the canonical approach, expanding to a new domain requires sequentially building and fixing each module before the next one can be trained reliably. In other words, training DP cannot be done effectively if NLU and DST change daily. Changes in one component could break the others, triggering a retraining of all subsequent modules. This interdependency slows down progress in subsequent modules. But with our end-to-end technique, we remove this dependency on upstream modules, boosting the development and training speed, and enabling us to fine-tune other models with less effort and less data.

With our new approach, dialogs are much more robust because they’re able to make decisions by looking at the full range of information in a single place. Previously, even a small error in one component could propagate to other components in unexpected, difficult-to-address ways. For example, current rule-based assistants are explicitly programmed to look for specific words or phrases — “p.m.” after a number to indicate afternoon — whereas Project CAIRaoke leverages advanced pretrained language models that better understand context and can recognize different ways to say the same thing.

Finally, Project CAIRaoke fuses the technology supporting Meta AI’s latest conversational bot, BlenderBot 2.0, into task-oriented dialogs. This means that assistants built using our model could exhibit empathetic language, relay knowledge found by searching the internet in real time, and exhibit a consistent personality.

When systems generate natural language, it’s essential to address the potential safety and privacy challenges. Most NLG components today are scripted so that content moderators ensure assistants don’t provide objectionable responses to users. But by connecting the assistant directly to the user, there’s a danger of mistakes or offensive interactions, as has been widely and infamously seen.

Importantly, we’ve incorporated safeguards built into BlenderBot that will help reduce instances of offensive responses. We are also building assistant technology with privacy in mind. For example, with both Ray-Ban Stories and Portal, the use of voice commands is optional, you can view and delete your transcripts of your voice commands, and you always have the option to turn off voice storage.

To mitigate the risk of generating objectionable responses to users, the first milestone of Project CAIRaoke was to generate both dialog action and natural language. In the short term, we generate dialog actions and rely on a tested and tightly constrained NLG system to provide the user response. In the long term, we’ll expose the generated sentences after ensuring the end-to-end integrity of our model.

Another issue, which is shared by other kinds of NLP systems, is hallucination, which is when a model confidently states information that isn’t correct. This is a big challenge for end-to-end techniques, as models may be prone to introduce or alter entities in the dialog based on training data. For example, if you ask an assistant to “set a reminder to call Ankita,” it may set up a reminder to call Ankit, since Ankita is a less common name. We used various data augmentation techniques and attention networks to add robustness to Project CAIRaoke and leveraged our work with BlenderBot 2.0 to reduce hallucination.

Using voice for myriad everyday tasks

While our short-term implementation of the Project CAIRaoke model is for reminders on Portal, we hope to soon be utilizing it on much larger domains which will help personalize people’s shopping experiences, enable assistants to maintain context over numerous chats, and let people drive the flow of conversation.

We also think this advancement is particularly useful for building AI-driven dialog capabilities for augmented reality. In the not-too-distant future, people will regularly use voice assistants on their AR glasses as they do today with smart speakers, watches, and other devices. With that in mind, we are working to reduce the size of end-to-end models like this one to fit them on-device, since on-device models also offer additional security, privacy, and performance benefits. We are also working to make the model easier to debug — a complicated challenge because in this new framework, information is represented in the embedding space, whereas in the canonical model, it’s explicit. To fully realize our vision for Project CAIRaoke, we will also need to scale it to many languages and find ways to use the model efficiently at billion-scale.

We can imagine that in a few years, the technology from Project CAIRaoke will underlie next-generation interaction between people and devices. On devices like VR headsets and AR glasses, we expect this type of communication to eventually be the ubiquitous, seamless method for navigation and interaction, much as how touchscreens replaced keypads on smartphones. Our current model is an important step forward, but we have more work to do to fully realize this vision. We are excited by both the progress we’ve made so far and the challenges ahead.

Watch the Meta AI Inside the Lab event here.

This work is being undertaken by a multidisciplinary team that includes Chinnadhurai Sankar, Maryam Daneshi, Nicholas Flores, Ahmad Beirami, David Whitney, Kaushik Ram Sadagopan, Ankita De, Stephen Roller, Paul Crook, Xiaoyu Zhai, Jeet Shah, Moya Chen, Eric Smith, Melanie Kambadur, Mary Williamson, Tushar Agarwal, Zhe Zhang, Shahin Shayandeh, Christopher Lin, Zhi Liu, Pooyan Amini, Jeremiah Chung, Satwik Kottur, Alexander Zotov, Paul Lee, Kyle Archie, Gabrielle Moskey, Soumya Banerjee, Piyush Khemka, Zhiqi Wang, John Bourassa, Yang Liu, Gerald Demeunynck, and Becka Silvert.

Written By

Research Manager