There are already hundreds of thousands of large language models (LLMs) in existence with a few dozen commercial systems dominating the market. Between options such as GPT-4, Claude and Gemini, many people have their favorite, especially when it comes to creative tasks such as writing.
Those preferences, however, are likely entirely in the eye of the beholder. According to new research from Duke University, the creative outputs of commercial LLMs are more similar to each other than users might hope. When challenged with three standard tasks assessing creativity, answers from commercial LLMs are much more alike than their human counterparts.
The results appeared online March 24 in the journal Proceedings of the National Academy of Sciences Nexus .
“People might wonder if different LLMs will take them in different directions with the same prompts for creative projects,” said Emily Wenger, the Cue Family Assistant Professor of Electrical and Computer Engineering at Duke. “This paper basically says no. LLMs are less creative as a population than humans.”
According to a 2024 survey by Adobe, over half of Americans have already used LLMs as creative partners for brainstorming, writing, creating images or writing code. Because an overwhelming majority of users trust them for help with being more creative, researchers have been trying to find out if that trust is misplaced.
One seminal paper in this emerging field conducted by Anil Doshi and Oliver Hauser found that writers who used GPT-4 produced more creative stories than humans working alone. However, the same study showed that those LLM-aided stories were more similar to each other than were stories from human writers working solo.
This research, and other papers like it, only looked at people using one specific LLM. Wenger, who studies how data gets into AI models, was curious how these types of results would translate between different LLMs.
“Commercial LLMs have all been trained on the same dataset—the entirety of the internet—and they all have the same goal,” Wenger said. “It seemed likely to me that this would limit the amount of diversity we’d see in their creativity, so I decided to find out.”
To explore her hunch, Wenger turned to Yoed Kenett, a cognitive neuroscientist and associate professor of data and decision sciences at the Technion – Israel Institute of Technology. Together, they settled on three standard tasks used to assess creativity levels and put 22 LLMs to the test against over 100 people.
One test, called the Alternative Uses Test (AUT), challenges participants to name different ways that an object could be used from its intended use. For example, using a book as a doorstop, fly swatter or kindling for a fire. The second test, called the Divergent Association Task (DAT), asks participants to name 10 different words, each as different as possible from the others in every sense. Lastly, the Forward Flow (FF) test provides a starting prompt word and asks participants to write down the next word that follows in their mind from the previous word for up to 20 words. For example, fire, candle, wax, hair, comb, honey, bee, stripes, zebra, etc.
Together, these tests seek to measure the divergent and dissociative thinking abilities that facilitate creativity.
“Significant empirical research on the past few decades highlight how much human creativity depends on variability,” said Yoed Kenett. “The problem, as we and others are increasingly showing, is that while LLMs appear to generate extremely original outputs, they are overly homogenized and not variable in their responses. This could have detrimental long-term impact on human creative thinking and thus must be addressed.”
The results, which aimed to measure the variability and originality in responses between LLMs and people, were clear. While individual LLMs might outperform individual people in levels of creativity, as a whole, the algorithms’ responses were much more similar to each other than the people’s. Importantly, altering the LLM system prompt to encourage higher creativity only slightly increased their variability—and human responses still won out.
“This work has broad implications as people continue adopting and integrating LLMs into their daily life,” Wenger said. “Over reliance on these tools will smooth the world’s work toward the same underlying set of words or grammar, tending to make writing all look the same.”
“If you’re trying to come up with an original concept or product to stand out from the crowd,” Wenger continued, “this work highly suggests you should bring together a diverse group of people to brainstorm rather than relying on AI.”
CITATION: “Large language models are homogeneously creative.” Emily Wenger and Yoed N. Kenett. PNAS Nexus , 2026, 5, pgag042. DOI: 10.1093/pnasnexus/pgag042
PNAS Nexus
Experimental study
Not applicable
Large language models are homogeneously creative.
24-Mar-2026