Products

AI Research

Resources

About

Products

Research

Generating chit-chat including laughs, yawns, 'ums,' & other nonverbal cues from raw audio

March, 31, 2022

Share on Facebook

Share on Twitter

In any given conversation, people exchange chock-full of nonverbal signals, like intonations, emotional expression, pauses, accents, rhythms — all of which are important to human interactions. But today’s AI systems fail to capture these rich, expressive signals because they learn only from written text, which captures what we say but not how we say it.

Last year, we introduced a breakthrough natural language processing (NLP) model that breaks free of the traditional dependence on text, called Generative Spoken Language Model (GSLM). GSLM discovers structured content by addressing raw audio signals, without any labels or text — like a person would do. It enables NLP models to capture the expressivity of oral language, and it can be used as a form of pretraining for downstream applications or as a generative tool, producing possible continuations from a given input audio prompt.

Today, we’re announcing three milestones toward more expressive NLP models:

First, we’ve open-sourced the Textless Python Library, which machine learning practitioners can use to quickly build experiments on top of GSLM components (encoder, language model, and decoder). Check out the library here.
Second, we can now model expressive vocalizations, like laughter, yawning, and cries. These expressions are essential to understanding the context of an interaction the way a person would, making it possible to convey nuances about their communicative intention or the sentiment they are trying to convey — whether that’s irony, irritation, boredom, etc. Check out samples here and below.
Third, we can model spontaneous, real-time chit-chat between two AI agents. The agents factor in behavior, like occasional overlaps or pauses, which will be important for building agents like virtual assistants that can understand nuanced social cues and signals, like interruptions, as well as positive or negative feedback when chatting with someone. Check out samples here and below.

As the world becomes more digital, and as we leverage AI in the metaverse, AI-powered applications will create new experiences that go beyond typing text toward more fluid ways of interaction, like voice and gesture. All these advancements using representation and self-supervised learning have the potential to help researchers break free of traditional text-based models and build more natural, engaging AI systems of the future.

Beyond lacking expressivity, traditional NLP applications, which rely on massive text resources, are available in only a handful of languages in the world. In the long term, we believe the advancement of textless NLP systems will also help make AI more inclusive of more people, particularly people who speak languages and dialects without standardized writing systems, such as dialectal Arabic or Swiss German.

Rendering realistic audio of emotive expression, from amused to sleepy

Neutral to amused

Something went wrong

We're having trouble playing this video.

Learn more

Something went wrong

We're having trouble playing this video.

Learn more

Neutral to angry

Something went wrong

We're having trouble playing this video.

Learn more

Something went wrong

We're having trouble playing this video.

Learn more

Neutral to sleepy

Something went wrong

We're having trouble playing this video.

Learn more

Something went wrong

We're having trouble playing this video.

Learn more

Demonstration of textless NLP model generating emotion from raw audio only.

One of the technical challenges of capturing emotional expressiveness in speech is that such expressions typically affect many aspects of language at once. When people, for instance, shift from expressing happiness to anger, they may use different vocabulary and insert cries, grunts, and other nonverbal vocalizations. They may modify prosody, like intonation and rhythm, and they may change voice quality due to stress. Likewise, each aspect of language contributes to the perception of the apparent emotional state of the speaker, where the nonverbal can completely change the meaning.

Text-only systems yield poor representations of these layers, often making it ambiguous without context, yielding miscomprehensions. To make things worse, all these manifestations of emotional expression are complex and resource-intensive to annotate, making it difficult to augment text-based systems with extra audio features derived from audio.

We took a radically different approach. Instead of trying to fix text-based systems, we holistically model all these layers from raw audio at the same time — achieving realistic audio rendering of emotive expressions for the first time. The key to this achievement is GSLM’s ability to capture generic audio events irrespective of whether they are verbal, in particular nonverbal vocalizations, like laughter or yawning, which inform the expression and perception of emotional states or intentions that can meaningfully influence conversations.

Something went wrong

We're having trouble playing this video.

Learn more

An illustration of the proposed system. We use pink to denote models and green to denote representations. The input signal is first encoded as a discrete sequence of units (representing speech content). Next, a sequence-to-sequence translation model (conditioned by the target emotion tokens) is applied over the discrete units to translate between emotions. This is followed by a duration prediction and F0 estimation for each of the units (also conditioned by target emotion). Finally, the speech signal is synthesized based on the translated units and predicted prosodic features, together with the speaker features and the target emotional expressions.

We illustrated this in an emotion conversion task, where a sentence is presented in one of five emotional expressions (neutrality, anger, amusement, sleepiness, or disgust) and converted to another one of these emotional expressions. We treated this problem as a sequence-to-sequence translation problem, enabling the easy insertion, deletion, or substitution of nonverbal vocalizations. In addition, we conditioned the generation of rhythm (duration), intonation (f0), and voice by the target emotion, which made it possible, for the first time, to cover all of the layers of the expression of emotion.

Upon evaluation, our model achieved vastly higher quality compared with previous best emotional voice conversion models. In fact, results are very close in quality to the original audio, illustrated in light green in the chart below:

We compared the proposed approach to state-of-the-art text-based emotional voice conversion model, Seq2seq-EVC. We also evaluated an expressive text-to-speech system based on Tacotron2. We report Mean-Opinion Scores (MOS) as Emotion Mean-Opinion Classification (eMOS) on the above figures. Results suggest that the proposed approach (dark green) is vastly superior to past approaches (dark purple and light purple) in terms of speech generation quality (MOS) and in capturing the target emotion (eMOS), and in fact is very close in quality to the original audio (light green).

Generating chit-chat with pauses, ‘ums,’ and overlapping speech

Unlike written exchanges, spoken dialogues take place in real time. That means even small perturbations in timing caused by transmission delays can disrupt the smoothness and naturality of spontaneous exchanges in a video call.

Something went wrong

We're having trouble playing this video.

Learn more

Research shows that in any given verbal conversation, speakers spontaneously interpret timing between speech units. People generally time their own speech with utmost precision. Gaps and pauses are informative, and overlapping speech or nonverbals can signal agreement, disagreement, or a willingness to take the floor.

Something went wrong

We're having trouble playing this video.

Learn more

Demonstration of annotated data sets with rich expressivity of oral language.

Until now, such richness has been very difficult to address with AI, and required highly complex, carefully annotated data sets like the one pictured above.

Our approach was to model speech content, nonverbal vocalization, and timing in a holistic way. The key idea is simple: We modeled a dialogue as two parallel streams of speech units automatically derived as in GSLM. We used two identical transformers — one per stream, trained to predict its own future sets of units but informed about the states of the other transformer through cross-attention.

Something went wrong

We're having trouble playing this video.

Learn more

Animation of architecture of textless chit-chat bots — it will include demonstration of chit-chat dialogue as well.

A few examples of generations are found here. The model is prompted by 10 seconds of real conversation and continues with its own version of the chit-chat. The model is able to reproduce naturalistic distributions of gaps, turn durations, and overlaps.

What’s next for textless NLP?

In the near future, we will focus on applying textless techniques to build useful downstream applications without requiring either resource-intensive text labels or automatic speech recognition systems (ASR), such as question answering (e.g., “How’s the weather?”). We believe prosody in speech can help better parse a sentence, which in turn facilitates understanding the intent and improves the performance of question answering.

One of these applications is speech-to-speech translation, or dubbing. Due to the resource-intensive nature of traditional AI systems, dubbing is typically done by going through text. For instance, it typically works by converting audio to text, performing translation, and converting text back to audio. This makes the entire system complicated and difficult to train. It also misses expressivity of oral language not just because intonations and nonverbal expressions are lost through text but also because language models are trained on text that are missing these idiomatic expressions specific to oral language. Because self-supervised speech representation approaches are able to learn discrete units from raw audio, it’s now possible to remove the need for text and replace it with the pseudo text extracted from each of the target and source languages.

We believe that textless NLP can outperform traditional composite systems (ASR+NLP) because of the possibility of integrating nonverbal vocalizations and prosodic information, which conveys rich semantic and pragmatic information on top of phonemes — typically not available in text.

Stay tuned for more on our efforts toward the new era of textless NLP.

The work discussed in this blog post reflects the contributions from Meta AI researchers Yossi Adi, Jade Copet, Emmanuel Dupoux, Wei-Ning Hsu, Evgeny Kharitonov, Felix Kreuk, Abdelrahman Mohamed, and Tu Anh Nguyen (listed in alphabetical order).