AI and Scientists Face Off to See Who Can Come Up With the Best Ideas

Scientific breakthroughs rely on decades of diligent work and expertise, sprinkled with flashes of ingenuity and, sometimes, serendipity.

What if we could speed up this process?

Creativity is crucial when exploring new scientific ideas. It doesn’t come out of the blue: Scientists spend decades learning about their field. Each piece of information is like a puzzle piece that can be reshuffled into a new theory—for example, how different anti-aging treatments converge or how the immune system regulates dementia or cancer to develop new therapies.

AI tools could accelerate this. In a preprint study, a team from Stanford pitted a large language model (LLM)—the type of algorithm behind ChatGPT—against human experts in the generation of novel ideas over a range of research topics in artificial intelligence. Each idea was evaluated by a panel of human experts who didn’t know if it came from AI or a human.

Overall, ideas generated by AI were more out-of-the-box than those by human experts. They were also rated less likely to be feasible. That’s not necessarily a problem. New ideas always come with risks. In a way, the AI reasoned like human scientists willing to try out ideas with high stakes and high rewards, proposing ideas based on previous research, but just a bit more creative.

The study, almost a year long, is one of the biggest yet to vet LLMs for their research potential.

The AI Scientist

Large language models, the AI algorithms taking the world by storm, are galvanizing academic research.

These algorithms scrape data from the digital world, learn patterns in the data, and use these patterns to complete a variety of specialized tasks. Some algorithms are already aiding research scientists. Some can solve challenging math problems. Others are “dreaming up” new proteins to tackle some of our worst health problems, including Alzheimer’s and cancer.

Although helpful, these only assist in the last stage of research—that is, when scientists already have ideas in mind. What about having an AI to guide a new idea in the first place?

AI can already help draft scientific articles, generate code, and search scientific literature. These steps are akin to when scientists first begin gathering knowledge and form ideas based on what they’ve learned.

Some of these ideas are highly creative, in the sense that they could lead to out-the-box theories and applications. But creativity is subjective. One way to gauge potential impact and other factors for research ideas is to call in a human judge, blinded to the experiment.

“The best way for us to contextualize such capabilities is to have a head-to-head comparison” between AI and human experts, study author Chenglei Si told Nature.

The team recruited over 100 computer scientists with expertise in natural language processing to come up with ideas, act as judges, or both. These experts are especially well-versed in how computers can communicate with people using everyday language. The team pitted 49 participants against a state-of-the-art LLM based on Anthropic’s Claude 3.5. The scientists earned $300 per idea plus an additional $1,000 if their idea scored in the top 5 overall.

Creativity, especially when it comes to research ideas, is hard to evaluate. The team used two measures. First, they looked at the ideas themselves. Second, they asked AI and participants to produce writeups simply and clearly communicating the ideas—a bit like a school report.

They also tried to reduce AI “hallucinations”—when a bot strays from the factual and makes things up.

The team trained their AI on a vast catalog of research articles in the field and asked it to generate ideas in each of seven topics. To sift through the generated ideas and choose the best ones, the team engineered an automatic “idea ranker” based on previous data reviews and acceptance for publication from a popular computer science conference.

The Human Critic

To make it a fair test, the judges didn’t know which responses were from AI. To disguise them, the team translated submissions from humans and AI into a generic tone using another LLM. The judges evaluated ideas on novelty, excitement, and—most importantly—if they could work.

After aggregating reviews, the team found that, on average, ideas generated by human experts were rated less exciting than those by AI, but more feasible. As the AI generated more ideas, however, it became less novel, increasingly generating duplicates. Digging through the AI’s nearly 4,000 ideas, the team found around 200 unique ones that warranted more exploration.

But many weren’t reliable. Part of the problem stems from the fact the AI made unrealistic assumptions. It hallucinated ideas that were “ungrounded and independent of the data” it was trained on, wrote the authors. The LLM generated ideas that sounded new and exciting but weren’t necessarily practical for AI research, often because of latency or hardware problems.

“Our results indeed indicated some feasibility trade-offs of AI ideas,” wrote the team.

Novelty and creativity are also hard to judge. Though the study tried to reduce the likelihood the judges would be able to tell which submissions were AI and which human by rewriting them with an LLM, like a game of telephone, changes in length or wording may have subtly influenced how the judges perceived submissions—especially when it comes to novelty. Also, the researchers asked to come up with ideas were given limited time to do so. They admitted their ideas were about average compared to their past work.

The team agrees there’s more to be done when it comes to evaluating AI generation of new research ideas. They also suggested AI tools carry risks worthy of attention.

“The integration of AI into research idea generation introduces a complex sociotechnical challenge,” they said. “Overreliance on AI could lead to a decline in original human thought, while the increasing use of LLMs for ideation might reduce opportunities for human collaboration, which is essential for refining and expanding ideas.”

That said, new forms of human-AI collaboration, including AI-generated ideas, could be useful for researchers as they investigate and choose new directions for their research.

Image Credit: Calculator LandPixabay

Shelly Fan
Shelly Fanhttps://neurofantastic.com/
Shelly Xuelai Fan is a neuroscientist-turned-science writer. She completed her PhD in neuroscience at the University of British Columbia, where she developed novel treatments for neurodegeneration. While studying biological brains, she became fascinated with AI and all things biotech. Following graduation, she moved to UCSF to study blood-based factors that rejuvenate aged brains. She is the co-founder of Vantastic Media, a media venture that explores science stories through text and video, and runs the award-winning blog NeuroFantastic.com. Her first book, "Will AI Replace Us?" (Thames & Hudson) was published in 2019.
RELATED
latest
Don't miss a trend
Get Hub delivered to your inbox

featured