July 16, 2021
Facebook AI Research has built and open-sourced BlenderBot 2.0, the first chatbot that can simultaneously build long-term memory it can continually access, search the internet for timely information, and have sophisticated conversations on nearly any topic. It’s a significant update to the original BlenderBot, which we open-sourced in 2020 and which broke ground as the first to combine several conversational skills — like personality, empathy, and knowledge — into a single system.
When talking to people, BlenderBot 2.0 demonstrated that it’s better at conducting longer, more knowledgeable, and factually consistent conversations over multiple sessions than its predecessor, the existing state-of-the-art chatbot.
The model takes pertinent information gleaned during conversation and stores it in a long-term memory so it can then leverage this knowledge in ongoing conversations that may continue for days, weeks, or even months. The knowledge is stored separately for each person it speaks with, which ensures that no new information learned in one conversation is used in another.
During conversation, the model can generate contextual internet search queries, read the results, and incorporate that information when responding to people’s questions and comments. This means the model stays up-to-date in an ever-changing world.
Today we’re releasing the complete model, code, and evaluation setup, as well as two new conversational data sets — human conversations bolstered by internet searches, and multisession chats with people that reference previous sessions — used to train the model, so other researchers can reproduce this work and advance conversational AI research.
Current language-generation models such as GPT-3 and Facebook AI’s first version of BlenderBot can articulately express themselves, at least in the context of ongoing conversations, and generate realistic-looking text. But they suffer from very short “goldfish memory,” and any long-term memory they do have is static — it’s limited to what they’ve been previously taught. They can never gain additional knowledge, which is why GPT-3 and BlenderBot believe that NFL superstar Tom Brady is still on the New England Patriots, and don’t know that he won the 2021 Super Bowl with the Tampa Bay Buccaneers. Similarly, it knows about past popular TV shows and movies, but isn’t aware of new series, like WandaVision.
If you told GPT-3 or BlenderBot 1.0 something yesterday, it will have forgotten it by today. Worse, because of deficiencies in their algorithms, those models infamously hallucinate knowledge — that is, confidently state information that isn’t correct
Chatbots need not be hamstrung by these limitations, which is why we’re excited to announce that we’re releasing a new open source chatbot, BlenderBot 2.0, through our research platform ParlAI. With its ability to access memory and reduce hallucination, BlenderBot 2.0 builds on the original version of BlenderBot, the first chatbot to blend a diverse set of conversational skills — including empathy, knowledge, and personality — together in one system.
Research into language model generation is moving quickly, and as an industry, we have better tools than ever before for significantly expanding chatbots’ conversational abilities. While existing systems can ask and answer basic questions about things like food, movies, or bands, they typically struggle with more complex or freeform conversations, like, for example, discussing Tom Brady’s career in detail.
But technology based on BlenderBot 2.0 could one day become a useful part of everyday life by being able to have multisession conversations on any topic that can last days, weeks, or even months, and by adding to what it knows and can talk about as the conversation evolves. That’s because it’s the first chatbot capable of generating internet search queries, using and building knowledge over time and referring back to previous ideas. These advances, including the ability to build long-term memory and augment conversations with information from the internet, overcome some shortcomings of current systems. In testing, we’ve found that BlenderBot 2.0 outperforms the conversational abilities of the best existing systems.
During conversations, BlenderBot 2.0 can query the internet using any search engine for relevant new knowledge and can both read and write to its long-term local memory store. In our research, we tested two approaches — a dump of the internet accessed via nearest neighbor lookup, and the Bing Search API. After searching the world’s information, it generates an appropriate conversational response based on what it has discovered. By accessing the internet and thus the ever-changing world, BlenderBot 2.0 is always up-to-date. This means our model can potentially incorporate into conversation the latest sports scores, movies, or TV shows that are playing right now, as well as the latest reviews, among the vast range of topics available on the internet.
BlenderBot 2.0 also remembers the context of previous discussions. So, for example, if you talked about Tom Brady with it weeks ago, it could potentially bring up the NFL in future conversations, as it knows that’s a relevant topic to you. Similarly, if you’d talked about movies with it prior to this year’s Academy Awards, it might subsequently bring up the Oscar-winning Nomadland. Plus, because of BlenderBot 2.0’s ability to leverage knowledge, it is less likely than other systems (as measured in our experimental evaluations) to hallucinate.
BlenderBot 2.0 uses a model based on Facebook’s Retrieval Augmented Generation — an approach that enables generating dialogue responses that incorporate knowledge beyond that contained in the conversation itself. During conversation, the model, which combines an information retrieval component with a seq2seq generator, seeks relevant information both in its long-term memory and from documents it finds by searching the internet. To do this, we augment the traditional encoder-decoder architecture with an additional neural network module that generates relevant search queries given the conversational context. BlenderBot 2.0 then prepends the resulting knowledge to the conversational history, which is encoded using the Fusion-in-Decoder method. Finally, taking this encoded knowledge into account, the chatbot generates a response. Our method both pulls from the chatbot’s long-term memory store and decides what to add to it. This is achieved by using an additional neural module which generates the memory to be stored based on the conversational context.
A current trend in machine learning models is to concentrate on training ever-larger models, which requires substantial computational resources. Those models attempt to store what they learn in their model weights. But storing the entire internet — which is always growing and changing — would seem to be next to impossible. Our method instead accesses the internet on the fly.
In order to train our neural networks, we collected training data specially for this purpose on a crowdsourcing platform. And today, we are also releasing the resulting conversational data sets, known as Wizard of the Internet and Multi-Session Chat:
Human conversations augmented with new information from internet searches (Wizard of the Internet)
Multisession, long-context chat with humans referencing knowledge from conversation sessions (Multi-Session Chat)
The first data set provides supervision for BlenderBot 2.0 on how to generate relevant search engine queries, as well as supervision of relevant responses based on the search results. The second data set provides supervision for the chatbot on which fresh knowledge to store in long-term memory, and supervision for relevant responses given those memories. We can thus perform multitask training combining the data sets, which enables BlenderBot 2.0 to act simultaneously with all these skills.
We want our new chatbot to build on its predecessor’s abilities. BlenderBot 1.0 was trained on the Blended Skill Talk tasks — engaging use of personality, knowledge and the display of empathy — and blends all three seamlessly. So BlenderBot 2.0 was trained with all these resources as well.
BlenderBot 1.0 was already shown to outperform other chatbots such as Meena and DialoGPT. To evaluate our new model, we thus benchmarked it against BlenderBot 1.0, evaluating its long-term conversation performance over multisession chat as well as its ability to successfully employ knowledge in conversation.
We find that our system outperforms BlenderBot 1.0 when it comes to picking up where previous conversation sessions left off, with a 17 percent improvement in engagingness score, and 55 percent improvement in use of previous conversation sessions, according to human evaluators. These results show that the new system’s long-term memory component enables it to sustain better conversation over a longer period of time.
When testing for ability to use knowledge, we find that BlenderBot 2.0 reduces hallucinations from 9.1 percent to 3.0 percent, and is factually consistent across a conversation 12 percent more often. The new chatbot’s ability to proactively search the internet enables these performance improvements.
We take the safety of conversational agents seriously, particularly because they’re known to sometimes generate unsafe or offensive utterances. This may happen when a person deliberately feeds it prompts designed to elicit an objectionable response. We have conducted a large study of available techniques to mitigate this issue, and we’ve developed two new methods: baked-in safety and robustness to difficult prompts. We are also co-organizing a series of dialogue safety workshops to help the research community make progress together. Our research shows that these approaches outperform existing techniques: Implementing our safety recipes reduced offensive responses, as measured by an automated classifier, by 90 percent while increasing the number of safe responses in conversations with real people by 74.5 percent. We include these safety recipes in our release. Further, an AI system that hallucinates something damaging or deceptive can cause real-world harm. In our tests with human evaluators, we determined that our methods alleviated this risk to some extent compared with previous methods.
But we know that safety issues are not yet solved, and BlenderBot 2.0’s approach of utilizing the internet and long-term memory to ground conversational responses brings new safety challenges. As a research community, we need to address them, and we believe reproducible research on safety, made possible by releases like this, will help the community make important new progress in this area together.
The model innovations we introduce here make important advances over current state-of-the-art systems, and we’re excited to see how other researchers advance on BlenderBot 2.0’s new abilities to build long-term memory and add to its knowledge by searching the internet. Of course, there’s still work to be done. While we’ve reduced model hallucinations, some remain. Until models have deeper understanding, they will sometimes contradict themselves. Similarly, our models cannot yet fully understand what is safe or not. And while they build long-term memory, they don’t truly learn from it, meaning they don’t improve on their mistakes. But we’ve made some small strides in that area. Finally, we look forward to a day soon when agents built to communicate and understand as humans do can see as well as talk, which we’ve explored with Multimodal BlenderBot. That’s why some obvious next steps are to continue to try to blend these latest techniques together into a single AI system, which is the goal of our BlenderBot research. We think that these improvements in chatbots can advance the state of the art in applications such as virtual assistants and digital friends. We hope this release and the corresponding data sets will help the community collectively make further progress in these and many other directions.