June 24, 2022
Whether it’s mingling at a party in the metaverse or watching a home movie in your living room while wearing augmented reality (AR) glasses, acoustics play a role in how these moments will be experienced. We are building for mixed reality and virtual reality experiences like these, and we believe AI will be core to delivering sound quality that realistically matches the settings people are immersed in.
Today, Meta AI researchers, in collaboration with an audio specialist from Meta’s Reality Labs and researchers from the University of Texas at Austin, are open-sourcing three new models for audio-visual understanding of human speech and sounds in video that are designed to push us toward this reality at a faster rate.
We need AI models that understand a person’s physical surroundings based on both how they look and how things sound. For example, there’s a big difference between how a concert would sound in a large venue versus in your living room. That’s because the geometry of a physical space, the materials and surfaces in the area, and the proximity of where the sounds are coming from all factor into how we hear audio.
The research we are sharing today with the AI community focuses on three audio-visual tasks that outperform existing methods. For our Visual Acoustic Matching model, we can input an audio clip recorded anywhere, along with an image of a target environment, and transform the clip to make it sound as if it were recorded in that environment. For example, the model could take an image of a dining room in a restaurant, together with the audio of a voice recorded in a cave, and make that voice sound instead like it was recorded in the pictured restaurant. The second model, Visually-Informed Dereverberation, does the opposite. Using observed sounds and the visual cues of a space, it focuses on removing reverberation, which is the echo a sound makes based on the environment where it is recorded. Imagine a violin concert in a busy train station. This model can distill the essence of the violinist’s music without the reverberations bouncing around the massive train station. The third model, VisualVoice, uses visual and audio cues to separate speech from other background sounds and voices, which will be beneficial for human and machine understanding tasks, such as creating better subtitles or mingling at a party in VR.
All three works tie into the body of AI research we are doing at Meta AI around audio-visual perception. We envision a future where people can put on AR glasses and relive a holographic memory that looks and sounds the exact way they experienced it from their vantage point, or feel immersed by not just the graphics but also the sounds as they play games in a virtual world. These models are bringing us even closer to the multimodal, immersive experiences we want to build in the future.
Anyone who has watched a video where the audio isn’t consistent with the scene knows how disruptive this can feel to human perception. However, getting audio and video from different environments to match has previously been a challenge. Acoustic simulation models can be used to generate a room impulse response to re-create the acoustics of a room, but this can be done only if the geometry — often in the form of a 3D mesh — and material properties of the space are known. In most cases, this information isn’t available. Acoustic properties can also be estimated from just the audio captured in a particular room, but this provides only limited acoustic information about the target space from the reverberation of the audio sample. While these approaches are available, they often do not yield great results.
To address these challenges, we created a self-supervised Visual Acoustic Matching model, called AViTAR, which adjusts audio to match the space of a target image. We use a cross-modal transformer model, where the inputs consist of both images and audio, allowing the transformer to perform intermodality reasoning and generate a realistic audio output that matches the visual input. The self-supervised training objective learns acoustic matching from in-the-wild web videos, despite their lack of acoustically mismatched audio and unlabeled data.
We built this task with two datasets. For our first dataset, we built on the work we did with SoundSpaces, the audio-visual platform for AI that we open-sourced in 2020. Built on top of AI Habitat, SoundSpaces makes it possible to insert high-fidelity, realistic simulations of any sound source into various real-world scanned environments from the open -source Replica and Matterport3D datasets.The second dataset consists of three- to 10-second clips of people speaking across 290,000 publicly available English-language videos.
For both datasets, we focused on speech in indoor settings, given their relevance to many of the possible future use cases and because human listeners have strong prior knowledge about how reverberation should affect speech. We filtered the datasets down to clips that met our problem formulation criterion: The microphone and camera needed to be located together and away from the sound source. This was important because sounds may be heard differently depending on where the source of the sound is and where the person or microphone is located.
One challenge we had to overcome for the web videos was that we only had audio matching the acoustics of the target environment. Because of this, we introduced the idea of mismatches — first by performing dereverberation to remove reverberation. We then intertwined the audio with the impulse response of another environment to randomize the acoustics, and added noise to create audio that has the same content but different acoustics.
We validated our model on both datasets and measured the quality of the generated audio on three criteria, including whether it was closest to the ground truth audio (if available), the correctness of room acoustics, and the speech quality preserved in the synthesized speech. But we also wanted to see how it performed with human listeners, whom we asked to evaluate whether the acoustics matched the reference image. The results show that our model successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional audio-only acoustic matching and more heavily supervised baselines.
For Visual Acoustic Matching, one future use case we are interested in involves reliving past memories. Imagine being able to put on a pair of AR glasses and see an object with the option to play a memory associated with it, such as picking up a tutu and seeing a hologram of your child’s ballet recital. The audio strips away reverberation and makes the memory sound just like the time you experienced it, sitting in your exact seat in the audience.
While there are many cases where adding reverberation with visual acoustic matching is helpful, there are also settings where we need to do the opposite, removing reverberation in order to enhance hearing and understanding.
Reverberation reflects off surfaces and objects in the environment, degrades the quality of speech for human perception, and severely affects the accuracy of automatic speech recognition. By removing dereverberation, we strip away environmental effects so that speech can be more easily recognized and enhanced, helping automatic speech recognition create more accurate subtitles for people with hearing loss, for example.
Prior approaches have tried to remove reverberation based solely on the audio modality, but this does not inform us of the complete acoustic characteristics of the environment. Blind dereverberation relies on prior knowledge of human speech to remove the reverberation, without accounting for the surrounding environment. This is why we need visual observations.
The Visually-Informed Dereverberation of Audio (VIDA) model learns to remove reverberation based on both the observed sounds and the visual stream, which reveals cues about room geometry, materials, and speaker locations — factors that influence the reverberation effects heard in the audio stream.
In this case, we want to take the reverberant audio from a specific place and strip away the room’s acoustic effects. To do this, we built on our work with SoundSpaces and developed a large-scale training dataset that uses realistic acoustic renderings of speech.
We demonstrated our approach on simulated and real imagery for speech enhancement, speech recognition, and speaker identification. Our results show that VIDA achieves state-of-the-art performance and is a substantial improvement over traditional audio-only methods. This will be important as we build realistic experiences for mixed and virtual reality.
A third model, VisualVoice, understands speech by looking as well as hearing. This is important for improving human and machine perception.
One reason people are better than AI at understanding speech in complex settings is that we use not just our ears but also our eyes. For example, we might see someone’s mouth moving and intuitively know the voice we’re hearing must be coming from that person. That’s why Meta AI is working on new conversational AI systems that, like humans, can recognize the nuanced correlations between what they see and what they hear in conversation.
VisualVoice learns in a way that’s similar to how people master new skills — multimodally — by learning visual and auditory cues from unlabeled videos to achieve audio-visual speech separation. For machines, this creates better perception, which can improve areas of accessibility, such as creating more accurate captions. Human perception also improves. For example, imagine being able to attend a group meeting in the metaverse with colleagues from around the world, but instead of people having fewer conversations and talking over one another, the reverberation and acoustics would adjust accordingly as they moved around the virtual space and joined smaller groups. VisualVoice generalizes well to challenging real-world videos of diverse scenarios.
Together, these models could one day enable smart assistants to hear what we’re telling them, no matter the circumstances — whether at a concert, at a crowded party, or in any other noisy place.
Existing AI models do a good job understanding images, and are getting better at video understanding. However, if we want to build new, immersive experiences for AR and VR, we need AI models that are multimodal — models that can take audio, video, and text signals all at once and create a much richer understanding of the environment.
This is an area we will continue exploring. AViTAR and VIDA are currently based on only a single image. In the future, we want to explore using video and other dynamics to capture the acoustic properties of a space. This will help bring us closer to our goal of creating multimodal AI that understands real-world environments and how people experience them.
We are excited to share this research with the open source community. We believe AI that understands the world around us can help unlock exciting new possibilities to benefit how people experience and interact in mixed and virtual reality.
Download our research papers and explore the project pages to see our models in action.
Visual Acoustic Matching
Research PaperProject PageVisually-Informed Dereverberation
Research PaperProject PageVisualVoice
Research PaperProject PageThe research in this blog post reflects the contributions of Kristen Grauman and Changan Chen. We'd also like to acknowledge Paul Calamia of Meta’s Reality Labs and Ruohan Gao of Stanford.
Foundational models
Latest news
Foundational models