AI and Scientists Face Off to See Who Can Come Up With the Best Ideas

Shelly Fan

Sep 27, 2024

Can AI generate original research ideas like a scientist can?

Scientific breakthroughs rely on decades of diligent work and expertise, sprinkled with flashes of ingenuity and, sometimes, serendipity.

What if we could speed up this process?

Creativity is crucial when exploring new scientific ideas. It doesn’t come out of the blue: Scientists spend decades learning about their field. Each piece of information is like a puzzle piece that can be reshuffled into a new theory—for example, how different anti-aging treatments converge or how the immune system regulates dementia or cancer to develop new therapies.

AI tools could accelerate this. In a preprint study, a team from Stanford pitted a large language model (LLM)—the type of algorithm behind ChatGPT—against human experts in the generation of novel ideas over a range of research topics in artificial intelligence. Each idea was evaluated by a panel of human experts who didn’t know if it came from AI or a human.

Overall, ideas generated by AI were more out-of-the-box than those by human experts. They were also rated less likely to be feasible. That’s not necessarily a problem. New ideas always come with risks. In a way, the AI reasoned like human scientists willing to try out ideas with high stakes and high rewards, proposing ideas based on previous research, but just a bit more creative.

The study, almost a year long, is one of the biggest yet to vet LLMs for their research potential.

The AI Scientist

Large language models, the AI algorithms taking the world by storm, are galvanizing academic research.

These algorithms scrape data from the digital world, learn patterns in the data, and use these patterns to complete a variety of specialized tasks. Some algorithms are already aiding research scientists. Some can solve challenging math problems. Others are “dreaming up” new proteins to tackle some of our worst health problems, including Alzheimer’s and cancer.

Although helpful, these only assist in the last stage of research—that is, when scientists already have ideas in mind. What about having an AI to guide a new idea in the first place?

AI can already help draft scientific articles, generate code, and search scientific literature. These steps are akin to when scientists first begin gathering knowledge and form ideas based on what they’ve learned.

Some of these ideas are highly creative, in the sense that they could lead to out-the-box theories and applications. But creativity is subjective. One way to gauge potential impact and other factors for research ideas is to call in a human judge, blinded to the experiment.

“The best way for us to contextualize such capabilities is to have a head-to-head comparison” between AI and human experts, study author Chenglei Si told Nature.

The team recruited over 100 computer scientists with expertise in natural language processing to come up with ideas, act as judges, or both. These experts are especially well-versed in how computers can communicate with people using everyday language. The team pitted 49 participants against a state-of-the-art LLM based on Anthropic’s Claude 3.5. The scientists earned $300 per idea plus an additional $1,000 if their idea scored in the top 5 overall.

Creativity, especially when it comes to research ideas, is hard to evaluate. The team used two measures. First, they looked at the ideas themselves. Second, they asked AI and participants to produce writeups simply and clearly communicating the ideas—a bit like a school report.

They also tried to reduce AI “hallucinations”—when a bot strays from the factual and makes things up.

Be Part of the Future

100% Free. No Spam. Unsubscribe any time.

The team trained their AI on a vast catalog of research articles in the field and asked it to generate ideas in each of seven topics. To sift through the generated ideas and choose the best ones, the team engineered an automatic “idea ranker” based on previous data reviews and acceptance for publication from a popular computer science conference.

The Human Critic

To make it a fair test, the judges didn’t know which responses were from AI. To disguise them, the team translated submissions from humans and AI into a generic tone using another LLM. The judges evaluated ideas on novelty, excitement, and—most importantly—if they could work.

After aggregating reviews, the team found that, on average, ideas generated by human experts were rated less exciting than those by AI, but more feasible. As the AI generated more ideas, however, it became less novel, increasingly generating duplicates. Digging through the AI’s nearly 4,000 ideas, the team found around 200 unique ones that warranted more exploration.

But many weren’t reliable. Part of the problem stems from the fact the AI made unrealistic assumptions. It hallucinated ideas that were “ungrounded and independent of the data” it was trained on, wrote the authors. The LLM generated ideas that sounded new and exciting but weren’t necessarily practical for AI research, often because of latency or hardware problems.

“Our results indeed indicated some feasibility trade-offs of AI ideas,” wrote the team.

Novelty and creativity are also hard to judge. Though the study tried to reduce the likelihood the judges would be able to tell which submissions were AI and which human by rewriting them with an LLM, like a game of telephone, changes in length or wording may have subtly influenced how the judges perceived submissions—especially when it comes to novelty. Also, the researchers asked to come up with ideas were given limited time to do so. They admitted their ideas were about average compared to their past work.

The team agrees there’s more to be done when it comes to evaluating AI generation of new research ideas. They also suggested AI tools carry risks worthy of attention.

“The integration of AI into research idea generation introduces a complex sociotechnical challenge,” they said. “Overreliance on AI could lead to a decline in original human thought, while the increasing use of LLMs for ideation might reduce opportunities for human collaboration, which is essential for refining and expanding ideas.”

That said, new forms of human-AI collaboration, including AI-generated ideas, could be useful for researchers as they investigate and choose new directions for their research.

Image Credit: Calculator Land / Pixabay

Shelly Fan

Dr. Shelly Xuelai Fan is a neuroscientist-turned-science-writer. She's fascinated with research about the brain, AI, longevity, biotech, and especially their intersection. As a digital nomad, she enjoys exploring new cultures, local foods, and the great outdoors.

Strands of purple electricity feed central glowing pink sphere

Hugging Face Says AI Models With Reasoning Use 30x More Energy on Average

Edd Gent

Dec 15, 2025

Study: AI Chatbots Choose Friends Just Like Humans Do

Edd Gent

Dec 09, 2025

A sphere-like glass building curves and reflects clouds and sky

AI Companies Are Betting Billions on AI Scaling Laws. Will Their Wager Pay Off?

Nathan Garland

Dec 05, 2025

Energy

Hugging Face Says AI Models With Reasoning Use 30x More Energy on Average

Edd Gent

Dec 15, 2025

Artificial Intelligence

Study: AI Chatbots Choose Friends Just Like Humans Do

Edd Gent

Dec 09, 2025

Tech

AI Companies Are Betting Billions on AI Scaling Laws. Will Their Wager Pay Off?

Nathan Garland

Dec 05, 2025

What we’re reading