There is an increasing interest in open-domain chatbots, which are built to communicate with humans on any topic, task, or domain. This interest has been supported by fictional characters and systems in the entertainment business (e.g. the movie “Her”), as well as the media attention received by chatbots developed in research labs of large tech companies, such as Google’s LaMDA and Facebook’s Blender.
The term “open-domain” suggests that these chatbots can converse on any topic, which is assumed to be more challenging than earlier attempts at building task-specific systems. However, the boundaries of “openness” and the criteria for evaluating these conversations are not well-defined.
Typically, a human tester is given an empty prompt and asked to “just chat with the system.” This is a highly unusual setting for human communication in the sense that we do not speak about anything with anybody and anywhere randomly. Instead, we are more selective about our choices of communication in terms of conversation topics according to our conversation partners and context (e.g., at work or at school). Therefore, building a truly “open-domain” chatbot is perhaps unrealistic and not even necessary, since human-human conversations are not that “open-domain” and random either.
In a recent paper, we argue that the term “open-domain” might not be very useful, and that the way current open-domain chatbots are evaluated might not really test whether they can actually engage in all the various forms of conversations that humans engage in.
When we communicate as humans, we assume some form of common ground, that is, that we have some things in common with each other. Apart from cultural norms and (perhaps) shared experiences, one of the things we assume is some form of joint activity, or purpose of the conversation. Even if we start small talking to a stranger when waiting for the bus, we both know that this is the type of activity we are engaged in, which will guide us in what might be appropriate to talk about in that context. When asked to “just chat” with a computer, there is no common ground or joint activity that we can assume.
One way to categorize joint activities in conversations is the notion of “speech events” introduced by Goldsmith & Baxter (1996), who recorded students’ everyday conversations over a few weeks and identified 39 speech events. These can be roughly grouped into Informal/Superficial talk (e.g., “small talk”, “joking around”, “sports talk”, “gossip”, “getting to know someone”), Involving Talk (e.g., “making up”, “love talk”, “relationships talk”, “complaining”), and Goal-directed Talk (e.g., “group discussion”, “persuading conversation”, “decision-making conversation”, “interrogation”, “asking a favor”).
What kind of speech events do users of open-domain chatbots actually engage in, when asked to “just chat”? To answer this question, we let two annotators annotate a random sample of the publicly available “open-domain” Google Meena chatbot conversations, according to their speech event category. The majority of the conversations (~88%) turned out to be about the “small talk” speech category, despite the fact that the human testers were instructed to talk about anything without any limitations on the topic.
As already mentioned, while actual small talk also assumes some form of common ground, this is perhaps the speech event that is most likely to take place, given the limited instructions. If these are the only speech events that take place in those evaluations, how do we know that they are truly “open-domain”? Would current open-domain chatbots be able to engage in other speech activities?
To address this question, we performed a preliminary experiment with Facebook’s Blender chatbot. A (human) tester interacted with the chatbot based on 16 of the speech event categories listed above. To set up a similar context, the same tester also chatted with another human on the same topics. The two humans (i.e., the tester and the interlocutor) did not know each other in advance and were unaware of each other’s identities.
The resulting conversations were compared (human-human vs. human-system) and evaluated by third-party human judges. In general, the evaluators rated the human-human conversations higher on a number of evaluation criteria, with an explanation that the human-human conversations were more coherent and had a better flow than the human-chatbot conversations. This is in stark contrast to the evaluation presented in the paper describing Facebook Blender, where the judges couldn’t really decide whether they preferred human-human or human-chatbot transcripts, based on the way they had evaluated it (which, as we have seen, gives rise to small talk conversations). Thus, what their evaluation really shows is that the Blender chatbot is fairly good at small talk, but not that it is good at “open-domain” dialog.
Since we did our study in 2021, new “open-domain” chatbots have emerged, using many more parameters and trained on more data, such as Google’s LaMDA. We haven’t tested to what extent they can handle other forms of speech events, and if they are truly “open-domain”, but as we have shown, current evaluations cannot help to answer this question.
A perhaps even more important question is whether the idea of an “open-domain” chatbot makes sense for us as humans at all. Instead, we should perhaps focus on conversational systems which are situated in human activities in a meaningful way, and where the user can assume some form of common ground and joint activity.
Source: bdtechtalks.com