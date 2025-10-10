Despite their usefulness, large language models still have a reliability problem. A new study shows that a team of AIs working together can score up to 97 percent on US medical licensing exams, outperforming any single AI.

While recent progress in large language models (LLMs) has led to systems capable of passing professional and academic tests, their performance remains inconsistent. They’re still prone to hallucinations—plausible sounding but incorrect statements—which has limited their use in high-stakes area like medicine and finance.

Nonetheless, LLMs have scored impressive results on medical exams, suggesting the technology could be useful in this area if their inconsistencies can be controlled. Now, researchers have shown that getting a “council” of five AI models to deliberate over their answers rather than working alone can lead to record-breaking scores in the US Medical Licensing Examination (USMLE).

“Our study shows that when multiple AIs deliberate together, they achieve the highest-ever performance on medical licensing exams,” Yahya Shaikh, from John Hopkins University, said in a press release. “This demonstrates the power of collaboration and dialogue between AI systems to reach more accurate and reliable answers.”

The researchers’ approach takes advantage of a quirk in the models, rooted in the non-deterministic way they come up with responses. Ask the same model the same medical question twice, and it might produce two different answers—sometimes correct, sometimes not.

In a paper in PLOS Medicine, the team describes how they harnessed this characteristic to create their AI “council.” They spun up five instances of OpenAI’s GPT-4 and prompted them to discuss answers to each question in a structured exchange overseen by a facilitator algorithm.

When their responses diverged, the facilitator summarized the differing rationales and got the group to reconsider the answer, repeating the process until consensus emerged.