March, 31, 2022
In any given conversation, people exchange chock-full of nonverbal signals, like intonations, emotional expression, pauses, accents, rhythms — all of which are important to human interactions. But today’s AI systems fail to capture these rich, expressive signals because they learn only from written text, which captures what we say but not how we say it.
Last year, we introduced a breakthrough natural language processing (NLP) model that breaks free of the traditional dependence on text, called Generative Spoken Language Model (GSLM). GSLM discovers structured content by addressing raw audio signals, without any labels or text — like a person would do. It enables NLP models to capture the expressivity of oral language, and it can be used as a form of pretraining for downstream applications or as a generative tool, producing possible continuations from a given input audio prompt.
Today, we’re announcing three milestones toward more expressive NLP models:
First, we’ve open-sourced the Textless Python Library, which machine learning practitioners can use to quickly build experiments on top of GSLM components (encoder, language model, and decoder). Check out the library here.
Second, we can now model expressive vocalizations, like laughter, yawning, and cries. These expressions are essential to understanding the context of an interaction the way a person would, making it possible to convey nuances about their communicative intention or the sentiment they are trying to convey — whether that’s irony, irritation, boredom, etc. Check out samples here and below.
Third, we can model spontaneous, real-time chit-chat between two AI agents. The agents factor in behavior, like occasional overlaps or pauses, which will be important for building agents like virtual assistants that can understand nuanced social cues and signals, like interruptions, as well as positive or negative feedback when chatting with someone. Check out samples here and below.
As the world becomes more digital, and as we leverage AI in the metaverse, AI-powered applications will create new experiences that go beyond typing text toward more fluid ways of interaction, like voice and gesture. All these advancements using representation and self-supervised learning have the potential to help researchers break free of traditional text-based models and build more natural, engaging AI systems of the future.
Beyond lacking expressivity, traditional NLP applications, which rely on massive text resources, are available in only a handful of languages in the world. In the long term, we believe the advancement of textless NLP systems will also help make AI more inclusive of more people, particularly people who speak languages and dialects without standardized writing systems, such as dialectal Arabic or Swiss German.
Neutral to amused
One of the technical challenges of capturing emotional expressiveness in speech is that such expressions typically affect many aspects of language at once. When people, for instance, shift from expressing happiness to anger, they may use different vocabulary and insert cries, grunts, and other nonverbal vocalizations. They may modify prosody, like intonation and rhythm, and they may change voice quality due to stress. Likewise, each aspect of language contributes to the perception of the apparent emotional state of the speaker, where the nonverbal can completely change the meaning.
Text-only systems yield poor representations of these layers, often making it ambiguous without context, yielding miscomprehensions. To make things worse, all these manifestations of emotional expression are complex and resource-intensive to annotate, making it difficult to augment text-based systems with extra audio features derived from audio.
We took a radically different approach. Instead of trying to fix text-based systems, we holistically model all these layers from raw audio at the same time — achieving realistic audio rendering of emotive expressions for the first time. The key to this achievement is GSLM’s ability to capture generic audio events irrespective of whether they are verbal, in particular nonverbal vocalizations, like laughter or yawning, which inform the expression and perception of emotional states or intentions that can meaningfully influence conversations.
We illustrated this in an emotion conversion task, where a sentence is presented in one of five emotional expressions (neutrality, anger, amusement, sleepiness, or disgust) and converted to another one of these emotional expressions. We treated this problem as a sequence-to-sequence translation problem, enabling the easy insertion, deletion, or substitution of nonverbal vocalizations. In addition, we conditioned the generation of rhythm (duration), intonation (f0), and voice by the target emotion, which made it possible, for the first time, to cover all of the layers of the expression of emotion.
Upon evaluation, our model achieved vastly higher quality compared with previous best emotional voice conversion models. In fact, results are very close in quality to the original audio, illustrated in light green in the chart below:
Unlike written exchanges, spoken dialogues take place in real time. That means even small perturbations in timing caused by transmission delays can disrupt the smoothness and naturality of spontaneous exchanges in a video call.
Research shows that in any given verbal conversation, speakers spontaneously interpret timing between speech units. People generally time their own speech with utmost precision. Gaps and pauses are informative, and overlapping speech or nonverbals can signal agreement, disagreement, or a willingness to take the floor.
Until now, such richness has been very difficult to address with AI, and required highly complex, carefully annotated data sets like the one pictured above.
Our approach was to model speech content, nonverbal vocalization, and timing in a holistic way. The key idea is simple: We modeled a dialogue as two parallel streams of speech units automatically derived as in GSLM. We used two identical transformers — one per stream, trained to predict its own future sets of units but informed about the states of the other transformer through cross-attention.
A few examples of generations are found here. The model is prompted by 10 seconds of real conversation and continues with its own version of the chit-chat. The model is able to reproduce naturalistic distributions of gaps, turn durations, and overlaps.
In the near future, we will focus on applying textless techniques to build useful downstream applications without requiring either resource-intensive text labels or automatic speech recognition systems (ASR), such as question answering (e.g., “How’s the weather?”). We believe prosody in speech can help better parse a sentence, which in turn facilitates understanding the intent and improves the performance of question answering.
One of these applications is speech-to-speech translation, or dubbing. Due to the resource-intensive nature of traditional AI systems, dubbing is typically done by going through text. For instance, it typically works by converting audio to text, performing translation, and converting text back to audio. This makes the entire system complicated and difficult to train. It also misses expressivity of oral language not just because intonations and nonverbal expressions are lost through text but also because language models are trained on text that are missing these idiomatic expressions specific to oral language. Because self-supervised speech representation approaches are able to learn discrete units from raw audio, it’s now possible to remove the need for text and replace it with the pseudo text extracted from each of the target and source languages.
We believe that textless NLP can outperform traditional composite systems (ASR+NLP) because of the possibility of integrating nonverbal vocalizations and prosodic information, which conveys rich semantic and pragmatic information on top of phonemes — typically not available in text.
Stay tuned for more on our efforts toward the new era of textless NLP.
The work discussed in this blog post reflects the contributions from Meta AI researchers Yossi Adi, Jade Copet, Emmanuel Dupoux, Wei-Ning Hsu, Evgeny Kharitonov, Felix Kreuk, Abdelrahman Mohamed, and Tu Anh Nguyen (listed in alphabetical order).
We’re announcing updates to Facebook’s population density maps, which can be used to coordinate and improve the delivery of humanitarian aid around the world, including global COVID-19 vaccinations.
April 15, 2021
Working with Inria researchers, we’ve developed a self-supervised image representation method, DINO, which produces remarkable results when trained with Vision Transformers. We are also detailing PAWS, a new method for 10x more efficient training.
April 30, 2021