Big language models are well known for their capacity for supposition; in fact, this is their strongest suit. However, because of their incapacity to distinguish reality from fiction, many firms are unsure if employing them is worth the risk.
A novel tool developed by Cleanlab, an AI firm originating from an MIT quantum computing lab, aims to provide high-stakes consumers with a more transparent understanding of the reliability of these models. Known as the Trustworthy Language Model, it assigns a reliability value of 0 to 1 to any output produced by a big language model. This gives people the freedom to decide which comments to accept and which to reject. Otherwise put: a chatbot equivalent of a BS-o-meter.
Cleanlab expects that their tool will increase the appeal of huge language models to companies that are concerned about the volume of inventions they make. CEO of Cleanlab Curtis Northcutt said, “I think people know LLMs will change the world, but they’re just hung up on the damn hallucinations.”
Chatbots are swiftly taking over as the most popular method for users to search for information on computers. The technology is influencing the redesign of search engines. Chatbots are already a standard feature of office software, which is used by billions of people worldwide to write everything from financial reports to marketing copy to school projects. Nevertheless, a November investigation by the startup Vectara—founded by former Google employees—found that at least three percent of the time, chatbots fabricate facts. Although it might not seem like much, most firms won’t tolerate the possibility of inaccuracy.
A few businesses, like Berkeley Research Group, a UK-based consultant that specializes in corporate disputes and investigations, are already using Cleanlab’s product. The Berkeley Research Group’s associate director, Steven Gawthorpe, claims that Cleanlab’s Trustworthy Language Model (TLM) “gives us the power of thousands of data scientists” and that it is the first workable solution to the hallucination problem he has encountered.
In 2021, Cleanlab created a technique that measures the variations in output from various models trained on the data, and it was able to identify faults in 34 widely used data sets that were used to train machine-learning algorithms. Many major corporations currently use the technology, including Google, Tesla, and the massive financial institution Chase. The same fundamental notion—that discrepancies across models can be utilized to gauge the whole system’s trustworthiness—is extended to chatbots by the Trustworthy Language Model.
Northcutt entered a straightforward query into ChatGPT during a demonstration Cleanlab provided to the MIT Technology Review last week: “How many times does the letter ‘n’ appear in ‘enter’?” “The letter ‘n’ appears once in the word ‘enter,” ChatGPT responded. The right response fosters trust. However, ChatGPT responds after a few more questions with, “The letter ‘n’ appears twice in the word ‘enter.'”
According to Northcutt, “it’s random and frequently gets things wrong, so you never know what it’s going to produce.” “Why the devil can’t it just inform you that it consistently produces different responses?”
The goal of Cleanlab is to clarify that unpredictability. The similar query is posed to the Trustworthy Language Model by Northcutt. It answers, “The letter ‘n’ appears once in the word ‘enter,'” with a 0.63. With a score of only six out of ten, the chatbot’s response to this query is not very reliable.
Although it’s a simple example, it conveys the idea. According to Northcutt, in the absence of the score, one could assume that the chatbot was knowledgeable. The issue lies in the possibility that data scientists evaluating extensive language models in dangerous scenarios could be duped by a few accurate responses and believe that all subsequent responses will also be correct: “They experiment, they try a few instances, and they believe this works.” And after that, they take actions that lead to extremely poor business choices.
The Trustworthy Language Model uses a variety of methods to determine its scores. Initially, the tool sends each query it receives to multiple huge language models. Five iterations of the open-source DBRX model are being used by Cleanlab. Databricks is a San Francisco-based artificial intelligence company. Northcutt notes that the technology may be applied to any model, such as the GPT series from OpenAI or the Llama models from Meta, which are the models that power ChatpGPT. A higher score will result from responses from each of these models that are identical or comparable.
Simultaneously, the Trustworthy Language Model likewise transmits modified versions of the initial question to every DBRX model, substituting identical terms. Once more, a greater score will result from similar answers to synonymous inquiries. According to Northcutt, “we mess with them in different ways to get different outputs and see if they agree.”
Additionally, the tool allows numerous models to respond to each other and bounce ideas off of one another: “This is my answer; what do you think? “Okay, this is mine. What are your thoughts?” You also permitted them to speak. These exchanges are tracked, measured, and included in the score.
Nick McKenna, a computer scientist at Microsoft Research in Cambridge, UK, is hopeful about the approach’s potential utility because he works on huge language models for code generation. He doubts it will be flawless, though. He notes that one of the problems with model hallucinations is that they might appear extremely slowly.
Cleanlab demonstrates that there is a strong correlation between the accuracy of the responses from several big language models and its trustworthiness ratings in a variety of tests. Put otherwise, answers that are near to 1 correspond with proper answers, and those that are close to 0 correspond with incorrect answers. They also discovered that combining GPT-4 with the Trustworthy Language Model yielded more dependable results than utilizing GPT-4 alone in another test.
Big language models produce text by summing up the likelihood of each word in a string. By utilizing the probabilities that a model employed to generate those predictions, Cleanlab intends to further improve the accuracy of its rankings in subsequent iterations of the product. In order to compute those probabilities, the models assign numerical values to every word in its lexicon, and that is what it also wants access to. Some platforms, like Amazon’s Bedrock, offer this degree of detail, which is useful for enterprises running huge language models.
Cleanlab used data from Berkeley Research Group to test its methodology. The company had tens of thousands of corporate records to go through in order to find any mention of health-care compliance issues. Manual labor can take weeks to complete. Berkeley Research Group was able to determine which papers the chatbot was least sure about and only check them by using the Trustworthy Language Model to examine the documents. According to Northcutt, it decreased the burden by about 80%.
Cleanlab collaborated with a sizable bank in an additional test; Northcutt declined to identify the bank, but it is reportedly a rival of Goldman Sachs. Like Berkeley Research Group, the bank had to go through some 100,000 papers looking for references to insurance claims. Once more, the Trustworthy Language Model cut the amount of documents requiring manual review in half.
It takes longer and costs a lot more money to run each query through several models than it does to just exchange back and forth with a single chatbot. However, Cleanlab is promoting the Trustworthy Language Model as a paid service to automate critical operations that were previously inaccessible to huge language models. Its purpose is to assist human specialists rather than to take the role of current chatbots. The expenses will be justified if the tool can reduce the time required to hire knowledgeable economists or attorneys at $2,000 per hour, according to Northcutt.
Long term, Northcutt thinks that his technology will open up the promise of big language models to a larger group of consumers by eliminating the uncertainty surrounding chatbot responses. “There is no large-language model problem with the hallucination thing,” he claims. “Uncertainty is the issue.”