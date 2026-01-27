Creativity is a trait that AI critics say is likely to remain the preserve of humans for the foreseeable future. But a large-scale study finds that leading generative language models can now exceed the average human performance on linguistic creativity tests.

The question of whether machines can be creative has gained new salience in recent years thanks to the rise of AI tools that can generate text and images with both fluency and style. While many experts say true creativity is impossible without lived experience of the world, the increasingly sophisticated outputs of these models challenge that idea.

In an effort to take a more objective look at the issue, researchers at the Université de Montréal, including AI pioneer Yoshua Bengio, conducted what they say is the largest ever comparative evaluation of machine and human creativity to date. The team compared outputs from leading AI models against responses from 100,000 human participants using a standardized psychological test for creativity and found that the best models now outperform the average human, though they still trail top performers by a significant margin.

“This result may be surprising—even unsettling—but our study also highlights an equally important observation: even the best AI systems still fall short of the levels reached by the most creative humans,” Karim Jerbi, who led the study, said in a press release.

The test at the heart of the study, published in Scientific Reports, is known as the Divergent Association Task and involves participants generating 10 words with meanings as distinct from one another as possible. The higher the average semantic distance between the words, the higher the score.

Performance on this test in humans correlates with other well-established creativity tests that focus on idea generation, writing, and creative problem solving. But crucially, it is also quick to complete, which allowed the researchers to test a much larger cohort of humans over the internet.

What they found was striking. OpenAI’s GPT-4, Google's Gemini Pro 1.5 and Meta’s Llama 3 and Llama 4, all outperformed the average human. However, when they measured the average performance of the top 50 percent of human participants, it exceeded all tested models. The gap widened further when they took the average of the top 25 percent and top 10 percent of humans.

The researchers wanted to see if these scores would translate to more complex creative tasks, so they also got the models to generate haikus, movie plot synopses, and flash fiction. They analyzed the outputs using a measure called Divergent Semantic Integration, which estimates the diversity of ideas integrated into a narrative. While the models did relatively well, the team found that human-written samples were still significantly more creative than AI-written ones.