Takeaways:
Communication between people is like a dance, with each person continuously adjusting what they say, how they say it, and how they gesture. Modeling two-party, or dyadic, conversation dynamics entails understanding the multimodal relationship between vocal, verbal, and visual social signals—and the interpersonal behaviors between people, such as listening, visual synchrony, and turn-taking. As virtual agents become important helpers in our daily lives, it's important that these systems be able to display these natural patterns of conversation. Today, the Meta Fundamental AI Research (FAIR) team together with Meta’s Codec Avatars lab and Core AI lab are introducing a family of Dyadic Motion Models that explore new frontiers of social AI. These models render human or language model generated speech between two individuals into diverse, expressive full-body gestures and active listening behaviors, allowing the creation of fully embodied avatars in 2D video and as 3D Codec Avatars. The models process audio and visual inputs to capture nuanced conversational dynamics with the potential to ultimately create more natural, interactive virtual agents that can engage in human-like social interactions across a variety of immersive settings.
The individual on the left side actively listens, nodding and maintaining eye contact while backchanneling.
Watch as the individual on the right side uses hand gestures in sync with their words, like when saying “chill”.
Visual rendering in 2D photorealistic style
Visual rendering in 3D
The models are enabled by the Seamless Interaction Dataset, which we’re publicly sharing to help the research community advance their work. The Seamless Interaction Dataset is the largest known video dataset of in-person, two-person conversation-based interactions and represents a crucial stepping stone to understanding and modeling how people communicate and behave when they’re together. In addition to the dataset, we’re publishing a technical report that details our methodology and findings. This report can help serve as a blueprint for future research on audiovisual behavioral interaction modeling. Given the importance of evaluating progress in this emerging field, we’re also proposing an evaluation methodology based on subjective and objective metrics that are informed by this dataset. The modeling capabilities that are built from the dataset will help transform social virtual agents, telepresence technologies in VR and AR settings, and multimodal video content analysis.
Leveraging the Seamless Interaction Dataset, we built a family of Dyadic Motion research models that show the potential capabilities of the dataset and also pave the way for future audiovisual behavioral modeling research.
The Audio-Visual (AV) Dyadic Motion models can jointly generate facial expressions and body gestures. The models use audio, either from two people or LLM speech output, as input to produce the behavioral component. Imagine visualizing a previously recorded podcast between two people speaking—generating the full spectrum of emotions, gestures, and movements implied by their speech. The AV Dyadic Motion models produce gestures and expressions of one specific speaker while taking into consideration the audio from both people. This allows the models to visualize speaking gestures, listening gestures, and turn-taking cues. The AV Dyadic Motion models go one step further by also taking into consideration the visual input of the other party. This enables the models to learn visual synchrony cues, such as smile mirroring or joint gaze attention.
The AV Dyadic Motion models can be used to animate avatar behavior, generating the facial expressions and body gestures of two people based on the pre-recordings of their voice. Our video results of this capability show the base case, where only the audio is used as input, and an enhanced case where the visual features are available for one of the two people. Including visual features allows us to show the visual synchrony learned by the audiovisual models. We’ve further developed these models by incorporating extra controllability parameters, providing greater flexibility and control over the models’ behavior. This can be particularly useful when users or designers want to adjust the expressivity of the avatar while speaking or listening. These controllability parameters can also be defined implicitly by the output of a speech LLM, providing visual guidance to the motion model.
Additionally, we designed our audiovisual behavioral research models to output intermediate codes for face and body motion, unlocking a wide range of possibilities for their application. This approach enables us to adapt these models for use in various contexts, including 2D video generation and the animation of 3D Codec Avatars, which can be used in immersive VR and AR experiences. Meta’s Codec Avatars lab has provided the research community with datasets and baseline reference implementations for Codec Avatars to support the advancement of metric telepresence research. More information about Codec Avatars can be found here.
The Seamless Interaction Dataset is the largest high-quality dataset that captures a diverse range of in-person, embodied interactions between two individuals, with facial and body signals recorded simultaneously. Our work is anchored in contemporary psychological theory, which provides a roadmap for us to collect a diversity of conversational topics, interpersonal stances, and emotions.
All conversations were recorded with participants in the same location to preserve the essential characteristics of embodied interaction and to avoid the drawbacks of remote, video-based communication. Of these recordings, one-third are interactions between two people who are familiar with each other, such as family, friends, or colleagues. While this familiarity presents a fascinating study on the effect of relationship on behavior, it also allows participants to be much more interactive from the start compared to the awkwardness that can sometimes occur when two strangers meet for the first time. It was important that the dataset also captured a wide range of human emotions and stances, such as surprise, disagreement, determination, and regret—in other words, the long-tail of human face-to-face behavior. These types of interactions are difficult to capture in naturalistic data, so we recruited professional actors with improvisational experience to portray a range of roles and emotions. These conversations account for approximately one-third of the dataset.
We designed the Seamless Interaction Dataset to enable training and evaluation of audiovisual behavioral AI models. However, this work can also serve as a resource to a range of disciplines interested in language, behavior, and face-to-face interaction. As part of this release, we’re sharing a methodology to objectively and subjectively evaluate audiovisual behavioral model generations. We explore a series of objective metrics typically used by the research community and propose a comprehensive methodology for subjective evaluations, which can help assess the progress of future research. We present a comparative evaluation protocol with criteria that focuses on speaking, listening, and turn-taking behavior. We hope that this methodology will help the community advance their research as we work toward building better social technologies for the benefit of everyone.
We prioritize privacy, ethics, and quality in data collection and processing for our research. This approach guides our AI research and applications.
Privacy and ethics
We implemented several measures to protect the privacy of the people who allowed us to record their conversations for the purpose of building the Seamless Interaction Dataset. During the creation of the dataset, participants consented to the collection of their recorded conversations and were advised to avoid sharing personally identifiable information. To further protect participant anonymity, about one third of the conversations were scripted, minimizing the risk of disclosing personal details. A post-collection quality assurance process was also established to analyze every video for private or sensitive material.
Quality assurance processes
A quality assurance process was implemented to identify occurrences of sensitive material and personally identifiable information. Flagged content was then removed from the dataset. Our process involved a multi-stage approach with an initial human-based quality review covering hundreds of hours of video including samples from every recorded session, followed by model-based analysis covering the entire dataset.
We employed a conservative filtering strategy by combining the flags from all three methodologies, ultimately removing hundreds of hours of interactions flagged by any one system. This approach allowed us to maintain high standards of quality and integrity in our data collection efforts.
Watermarking
We use AudioSeal and VideoSeal to watermark content that is generated from our audiovisual behavioral models. These models enable us to embed secret messages into individual frames of original content, which can then be extracted by detectors. This allows us to verify the authenticity and origin of content, even after processing or manipulation. By implementing watermarking, we aim to provide an additional layer of security, transparency, and accountability.
Our research models have the potential to transform future social technologies that help enhance our daily lives, entertain, and bring us closer together. By prioritizing responsible AI practices, we hope to continue to build trust in our models and create technology that benefits everyone. We look forward to seeing how the community uses the dataset and technical report we’re sharing today to advance their work.
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.
Our approach
Latest news
Foundational models