In psychology, it has long been debated whether the human mind can be explained using a unified theory or whether each aspect of the human mind, e.g., attention and memory, has to be separately studied. Now, artificial intelligence (AI) models are entering the discussion, offering a new way to probe this age‑old question.
In July 2025, Nature published a groundbreaking study introducing an AI model named "Centaur". Built upon conventional large language models and fine‑tuned with psychological experiment data, this model claimed to accurately simulate human cognitive behavior across 160 tasks covering decision‑making, executive control, and other domains. The achievement attracted widespread attention and was regarded as potentially signaling AI’s capability to comprehensively simulate human cognition.
However, a recent study published in National Science Open has raised significant doubts about the Centaur model. The research team from Zhejiang University pointed out that the "human cognitive simulation ability" demonstrated by Centaur is likely a result of overfitting—meaning the model did not genuinely understand the experimental tasks but merely learned answer patterns from the training data.
To validate this perspective, the research team designed multiple testing scenarios. For instance, they replaced the original multiple‑choice question stems, which described specific psychological tasks, with the instruction "Please choose option A". In such a scenario, if the model truly understood the task requirement, it should consistently select option A. However, in actual testing, Centaur still chose the "correct answers" from the original question database. This indicates that the model did not make judgments based on the semantic meaning of the questions but relied on statistical patterns to "guess" the answers—akin to a student achieving high scores through test‑taking strategies without understanding the questions.
This study serves as a reminder to adopt a more cautious approach when evaluating the capabilities of large language models. While large language models are powerful tools for data fitting, their "black‑box" nature makes them prone to issues such as hallucinations and misinterpretations. Only through precise and multi‑faceted evaluations can we determine whether a model genuinely possesses certain professional abilities.
Notably, despite Centaur’s positioning as a "cognitive simulation" model, its most significant shortcoming lies in language comprehension itself, specifically, in capturing and responding to the intent of the questions. This study also suggests that genuine language understanding may be the most critical technological bottleneck in the path toward building general cognitive models.
National Science Open
Computational simulation/modeling