PULLMAN, Wash. — Again and again, Washington State University professor Mesut Cicek and his colleagues fed hypotheses from scientific papers into ChatGPT and asked it to determine whether the statements had been upheld by research — whether they were true or false.
They did this with more than 700 hypotheses, repeating each query 10 times.
AI answered correctly 76.5% of the time when the experiment was run in 2024. When it was repeated in 2025, the accuracy improved to 80%. When accounting for random guessing, however, AI was only about 60% better than chance — closer to a low D than to high reliability.
It struggled most to identify hypotheses as false, getting those answers correct just 16.4% of the time. Furthermore, ChatGPT was inconsistent: Across 10 identical prompts, it consistently estimated only 73% of the statements accurately.
“We're not just talking about accuracy, we're talking about inconsistency, because if you ask
the same question again and again, you come up with different answers,” said Cicek, an associate professor in the Department of Marketing and International Business in WSU’s Carson College of Business and lead author of the new publication.
“We used 10 prompts with the same exact question. Everything was identical. It would answer true. Next, it says it’s false. It’s true, it’s false, false, true. There were several cases where there were five true, five false.”
The findings, published in the Rutgers Business Review , reinforce the need to apply skepticism and caution when using AI for critical tasks, especially those that involve nuance or complicated reasoning. They show that the Generative AI’s linguistic fluency is not yet matched by conceptual intelligence, and suggest the much-touted arrival of artificial general intelligence that can truly “think” is farther off than some are predicting, Cicek said.
“Current AI tools don't understand the world the way we do — they don't have a ‘brain,’” Cicek said. “They just memorize, and they can give you some insight, but they don't understand what they’re talking about.”
Cicek’s co-authors were Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University.
The researchers used 719 hypotheses from scientific papers published in business journals since 2021 to challenge the ability of the free, commonly available generative AI tools to answer questions that involve nuance and complexity. Whether research supports a given hypothesis is often a complicated question, with different factors that may qualify or balance the findings. Boiling it down to a simple true-or-false answer requires reasoning.
Cicek and his colleagues ran the experiment with the free version of ChatGPT-3.5 in 2024, and the free, updated ChatGPT-5 mini in 2025. Overall, the accuracy remained similar between the versions. When these responses were adjusted for random chance—the fact that a wild guess has a 50% likelihood of being correct—the accuracy was just 60% better than random chance in both years.
The results highlight a key gap in large language model AI tools: while they can produce fluent, convincing language, their ability to reason through complex questions often falls short, sometimes leading them to deliver persuasive explanations for incorrect answers, Cicek said.
The researchers concluded that business managers should emphasize the need to verify AI results, treat them with skepticism and provide training in what AI can, and can’t, do well.
In the current paper, Cicek focused only on results with ChatGPT, but he has run similar tests with other AI tools and found comparable results. The study also builds upon past work of Cicek’s that raises reasons to be cautious about AI hype. A paper published in 2024 reported results of a national survey that found consumers were less likely to want to buy products when they were marketed with an emphasis on AI.
“Always be skeptical,” he said. “I'm not against AI. I’m using it. But you need to be very careful.”
Experimental study
Unstable Intelligence: GenAI Struggles with Accuracy and Consistency
19-Dec-2025
No conflicts declared.