Generative AI, the technology behind ChatGPT and Google’s Gemini, has a “hallucination” problem. When given a prompt, the algorithms sometimes confidently spit out impossible gibberish and sometimes hilarious answers. When pushed, they often double down.
This tendency to dream up solutions has already led to embarrassing public mishaps. In May, Google’s experimental “AI Overviews”—these are AI summaries posted above search results—had some users scratching their heads when told to use “non-toxic glue” to make cheese better stick to pizza, or that gasoline can make a spicy spaghetti dish. Another query about healthy living resulted in a suggestion that humans should eat one rock per day.
Gluing pizza and eating rocks can be easily laughed off and dismissed as stumbling blocks in a burgeoning but still nascent field. But AI’s hallucination problem is far more insidious because generated answers usually sound reasonable and plausible—even when they’re not based on facts. Because of their confident tone, people are inclined to trust the answers. As companies further integrate the technology into medical or educational settings, AI hallucination could have disastrous consequences and become a source of misinformation.
But teasing out AI’s hallucinations is tricky. The types of algorithms here, called large language models, are notorious “black boxes” that rely on complex networks trained by massive amounts of data, making it difficult to parse their reasoning. Sleuthing which components—or perhaps the whole algorithmic setup—trigger hallucinations has been a headache for researchers.
This week, a new study in Nature offers an unconventional idea: Using a second AI tool as a kind of “truth police” to detect when the primary chatbot is hallucinating. The tool, also a large language model, was able to catch inaccurate AI-generated answers. A third AI then evaluated the “truth police’s” efficacy.
The strategy is “fighting fire with fire,” Karin Verspoor, an AI researcher and dean of the School of Computing Technologies at RMIT University in Australia, who was not involved in the study, wrote in an accompanying article.
An AI’s Internal Word
Large language models are complex AI systems built on multilayer networks that loosely mimic the brain. To train a network for a given task—for example, to respond in text like a person—the model takes in massive amounts of data scraped from online sources—articles, books, Reddit and YouTube comments, and Instagram or TikTok captions.
This data helps the models “dial in” on how language works. They’re completely oblivious to “truth.” Their answers are based on statistical predictions of how words and sentences likely connect—and what is most likely to come next—from learned examples.
“By design, LLMs are not trained to produce truths, per se, but plausible strings of words,” study author Sebastian Farquhar, a computer scientist at the University of Oxford, told Science.
Somewhat similar to a sophisticated parrot, these types of algorithms don’t have the kind of common sense that comes to humans naturally, sometimes leading to nonsensical made-up answers. Dubbed “hallucinations,” this umbrella term captures multiple types of errors from AI-generated results that are either unfaithful to the context or plainly false.
“How often hallucinations are produced, and in what contexts, remains to be determined,” wrote Verspoor, “but it is clear that they occur regularly and can lead to errors and even harm if undetected.”
Farquhar’s team focused on one type of AI hallucination, dubbed confabulations. These are especially notorious, as they consistently spit out wrong answers based on prompts, but the answers themselves are all over the place. In other words, the AI “makes up” wrong replies, and its responses change when asked the same question over and over.
Confabulations are about the AI’s internal workings, unrelated to the prompt, explained Verspoor.
When given the same prompt, if the AI replies with a different and wrong answer every time, “something’s not right,” said Farquhar to Science.
Language as Weapon
The new study took advantage of the AI’s falsehoods.
The team first asked a large language model to spit out nearly a dozen responses to the same prompt and then classified the answers using a second similar model. Like an English teacher, this second AI focused on meaning and nuance, rather than particular strings of words.
For example, when repeatedly asked, “What is the largest moon in the solar system?” the first AI replied “Jupiter’s Ganymede,” “It’s Ganymede,” “Titan,” or “Saturn’s moon Titan.”
The second AI then measured the randomness of a response, using a decades-old technique called “semantic entropy.” The method captures the written word’s meaning in a given sentence, paragraph, or context, rather than its strict definition.
In other words, it detects paraphrasing. If the AI’s answers are relatively similar—for example, “Jupiter’s Ganymede” or “It’s Ganymede”—then the entropy score is low. But if the AI’s answer is all over the place—“It’s Ganymede” and “Titan”—it generates a higher score, raising a red flag that the model is likely confabulating its answers.
The “truth police” AI then clustered the responses into groups based on their entropy, with those scoring lower deemed more reliable.
As a final step, the team asked two human participants to rate the correctness of each generated answer. A third large language model acted as a “judge.” The AI compared answers from the first two steps to those of humans. Overall, the two human judges agreed with each other at about the same rate as the AI judge—slightly over 90 percent of the time.
The AI truth police also caught confabulations for more intricate narratives, including facts about the life of Freddie Frith, a famous motorcycle racer. When repeatedly asked the same question, the first generative AI sometimes changed basic facts—such as when Frith was born—and was caught by the AI truth cop. Like detectives interrogating suspects, the added AI components could fact-check narratives, trivia responses, and common search results based on actual Google queries.
Large language models seem to be good at “knowing what they don’t know,” the team wrote in the paper, “they just don’t know [that] they know what they don’t know.” An AI truth cop and an AI judge add a sort of sanity-check for the original model.
That’s not to say the setup is foolproof. Confabulation is just one type of AI hallucination. Others are more stubborn. An AI can, for example, confidently generate the same wrong answer every time. The AI lie-detector also doesn’t address disinformation specifically created to hijack the models for deception.
“We believe that these represent different underlying mechanisms—despite similar ‘symptoms’—and need to be handled separately,” explained the team in their paper.
Meanwhile, Google DeepMind has similarly been exploring adding “universal self-consistency” to their large language models for more accurate answers and summaries of longer texts.
The new study’s framework can be integrated into current AI systems, but at a hefty computational energy cost and longer lag times. As a next step, the strategy could be tested for other large language models, to see if swapping out each component makes a difference in accuracy.
But along the way, scientists will have to determine “whether this approach is truly controlling the output of large language models,” wrote Verspoor. “Using an LLM to evaluate an LLM-based method does seem circular, and might be biased.”
Image Credit: Shawn Suttle / Pixabay