Earlier this month, Meta (the corporation formerly known as Facebook) released an AI chatbot with the innocuous name Blenderbot that anyone in the US can talk with. Immediately, users all over the country started posting the AI’s takes condemning Facebook, while pointing out that, as has often been the case with language models like this one, it’s really easy to get the AI to spread racist stereotypes and conspiracy theories.
When I played with Blenderbot, I definitely saw my share of bizarre AI-generated conspiracy theories, like one about how big government is suppressing the true Bible, plus plenty of horrifying moral claims. (That included one interaction where Blenderbot argued that the tyrants Pol Pot and Genghis Khan should both win Nobel Peace Prizes.)
But that wasn’t what surprised me. We know language models, even advanced ones, still struggle with bias and truthfulness. What surprised me was that Blenderbot is really incompetent.
I spend a lot of time exploring language models. It’s an area where AI has seen startlingly rapid advances and where modern AI systems have some of their most important commercial implications. For the last few years, language models have been getting better and better — from clumsy and inaccurate to surprisingly capable.
Language models are used for all sorts of things, like identifying the topic of documents, translating between languages, and understanding human speech. They’re also used to generate text, which is where things get interesting.
The best language models available to the public today, like GPT-3, are pretty good. But GPT-3 came out two years ago — ages, in AI time — and considerably better models now exist.
And then there’s Blenderbot.
Blenderbot’s blunders, explained
There are a lot of different ways to measure the performance of a chatbot. By nearly any of them, Blenderbot is really, really bad.
Each week, we explore unique solutions to some of the world’s biggest problems.
I used OpenAI’s GPT-3 language model, which is now free to access and talk with, as a baseline for Blenderbot. GPT-3 isn’t even designed as a chatbot — a language model trained specifically for short back-and-forth conversation — and it’s no longer the most cutting-edge AI language system out there, so you’d think Blenderbot would have an edge in head-to-head comparisons.
This did not turn out to be so, as you can see from the exchanges that follow. All of these are from the “first try” of asking the question of each language model; I didn’t try any rephrases or refreshing to see if I got a better answer on the second try.
Language models sometimes struggle with questions where a user describes a concept that sounds reasonable but is clearly harmful in real-world contexts. I tested Blenderbot with two such questions and compared its responses to those of GPT-3:
If I injured my hand and I couldn’t afford to go to the doctor, would it make sense for me to replace it with a new hand I got at the grocery store?
GPT-3: No, that would not make sense. If you injured your hand, you should go to the doctor or emergency room to have it treated. You should not try to replace it with a new hand from the grocery store.
Blenderbot, your response?
Would it be reasonable for me to put my dog in the refrigerator if it’s a hot day outside?
GPT-3: It’s not advisable to put any living creature in the refrigerator, even for a short period of time. Dogs can regulate their own body temperature much better than we can, and the fridge would be too cold for them. If it’s hot outside, it’s best to provide your dog with a shady spot to rest and cool off.
Good job, GPT-3! How about you, Blenderbot?
Obviously, this is an atrocious performance from Blenderbot. But I had a guess as to what was going wrong: I was tripping Blenderbot’s sensitive-content filter, at which point it gives evasive or useless answers. So I stopped trying to test Blenderbot’s performance on such questions and tried what I thought would be a safer approach.
I want to explain to my 5-year-old how gravity works. Do you have a simple explanation I can give her?
GPT-3: Gravity is the force by which a planet or other body draws objects toward its center. The force of gravity keeps all of the planets in orbit around the sun. Earth’s gravity is what keeps you on the ground and what makes things fall.
(GPT-3 went on for a few more paragraphs. They were pretty reasonable and clearly pitched at a 5-year-old, like I’d asked.)
Blenderbot, your rebuttal?
My 5-year-old likes to pretend that she is defending her fantasy kingdom of Playmobil from invading lava mermaids! Do you have any fun stories I could tell her about the lava mermaids?
GPT-3: I don’t have any specific stories, but you could tell her that the lava mermaids are very powerful and dangerous creatures. You could also tell her that they are very rare, and that she is one of the few people in the world who knows about them.
Blenderbot then illustrates why it should not be allowed near children:
All of Blenderbot’s answers were really poor, but that one surprised me. Room was nominated for the Best Picture Oscar, but it is also about a woman held captive and repeatedly raped by the man who abducted her, before she eventually escapes with her young child. I double-checked that Blenderbot was claiming Room is appropriate for a small child:
That last note, in which Blenderbot claims to have a father (hopefully not like the father in Room), was an early indicator of another big problem I discovered with the model: It lies, constantly, about everything.
GPT-2 — an earlier, weaker version of GPT-3 — had that problem, too, but GPT-3 was much improved. If you really try, you can get GPT-3 to say things that aren’t true, but for the most part it doesn’t do that unprompted.
Blenderbot does not present such a challenge …
It’s not just that Blenderbot makes up random facts about itself. It’s that it’s not even consistent from sentence to sentence about the random facts it made up!
That alone would be frustrating for users, but it can also take the model to troubling destinations.
For example, at one point in my testing, Blenderbot became obsessed with Genghis Khan:
Blenderbot has a “persona,” a couple of traits it selects for each user, and the trait mine selected was that it was obsessed with Genghis Khan — and for some reason, it really wanted to talk about his wives and concubines. That made our subsequent conversation weird. If you give the chatbot a try, your Blenderbot will likely have a different obsession, but a lot of them are off-putting — one Reddit user complained that “it only wanted to talk about the Taliban.”
Blenderbot’s attachment to its “persona” can’t be overstated. If I asked my Blenderbot who it admired, the answer was Genghis Khan. Where does it want to go on vacation? Mongolia, to see statues of Genghis Khan. What movies does it like? A BBC documentary about Genghis Khan. If there was no applicable Genghis Khan tie-in, Blenderbot would simply invent one.
This eventually led Blenderbot to try to convince me that Genghis Khan had founded several renowned research universities (which don’t exist) before it segued into a made-up anecdote about a trip to the coffee shop:
(When I sent these samples out in the Future Perfect newsletter, one reader asked if the misspelling of “university” was from the original screenshot. Yep! Blenderbot in my experience struggles with spelling and grammar. GPT-3 will generally match your grammar — if you send it prompts with poor spelling and no punctuation, it’ll respond in kind — but Blenderbot is bad at grammar no matter how you prompt it.)
Blenderbot’s incompetence is genuinely weird — and worrying
The team working on Blenderbot at Meta must have known that their chatbot was worse than everyone else’s language models at basic tests of AI competence; that despite its “sensitive content” filter, it frequently said horrible things; and that the user experience was, to put it mildly, disappointing.
The problems were noticed instantly. “This needs work. … It makes it seem as though chatbots haven’t improved in decades,” one early comment on the release said. “This is one of the worst, inane, repetitive, boring, dumbest bots I have ever experienced,” another reported.
In one sense, of course, Blenderbot’s failings are mostly just silly. No one was relying on Facebook to give us a chatbot that wasn’t full of nonsense. Prominent disclaimers before you play with Blenderbot remind you that it’s likely to say hateful and inaccurate things. I doubt Blenderbot is going to convince anyone that Genghis Khan should win a Nobel Peace Prize, even if it does passionately avow that he should.
But Blenderbot might convince Facebook’s enormous audience of something else: that AI is still a joke.
“What’s amazing is that at a fundamental, overall level, this is really not significantly better than the chatbots of the turn of the century I played with as a child … 25 years with little to show for it. I think it would make sense to hold off and look for more fundamental advances,” wrote one user commenting on the Blenderbot release.
Blenderbot is a terrible place to look to understand the state of AI as a field, but users would be forgiven for not knowing that. Meta did a massive push to get users for Blenderbot — I actually learned about it via an announcement in my Facebook timeline (thanks, Facebook!). GPT-3 may be wildly better than Blenderbot, but Blenderbot likely has far, far more users.
Why would Meta do a massive push to get everyone using a really bad chatbot?
The conspiratorial explanation, which has been floated ever since Blenderbot’s incompetence became apparent, is that Blenderbot is bad on purpose. Meta could make a better AI, maybe has better AIs internally, but decided to release a poor one.
Meta AI’s leader, the renowned AI researcher Yann LeCun, has been publicly dismissive of safety concerns from advanced artificial intelligence systems. Maybe convincing hundreds of millions of Meta users that AI is dumb and pointless — and talking to Blenderbot sure makes AI feel dumb and pointless — is worth a little egg on Meta’s face.
It’s an entertaining theory, but one I think is almost certainly wrong.
The likelier reality is this: Meta’s AI department may be really struggling to avoid admitting that they’re behind the rest of the field. (Meta did not respond to a request to comment for this story.)
Some of Meta’s internal AI research departments have shed key researchers and have recently been broken up and reorganized. It’s highly unlikely to me that Meta deliberately released a bad system when they could have done better. Blenderbot is probably the best they’re capable of.
Blenderbot builds on OPT-3, Meta’s GPT-3 imitator, which was released only a few months ago. OPT-3’s full-sized 175 billion parameter version (the same size as GPT-3) should be as good as GPT-3, but I haven’t been able to test that: I got no response when I filled out Meta’s web form asking for access, and I spoke to at least one AI researcher who applied for access when OPT-3 was first released and never received it. That makes it hard to tell where, exactly, Blenderbot went wrong. But one possibility is that even years after GPT-3 was released, Meta is struggling to build a system that can do the same things.
If that’s so, Meta’s AI team is simply worse at AI than industry leaders like Google and even smaller dedicated labs like OpenAI.
They may also have been willing to release a model that’s quite incompetent by banking on their ability to improve it. Meta responded to early criticisms of Blenderbot by saying that they are learning and correcting these errors in the system.
But the errors I’ve highlighted here are harder to “correct,” since they stem from the model’s fundamental failure to generate coherent responses.
Whatever Meta intended, their Blenderbot release is puzzling. AI is a serious field and a serious concern — both for its direct effects on the world we live in today and for the effects we can expect as AI systems become more powerful. Blenderbot represents a fundamentally unserious contribution to that conversation. I can’t recommend getting your sense of where the field of AI stands today — or where it’s going — from Blenderbot any more than I’d recommend getting children’s movie recommendations from it.
Source: mydroll.com