AI fails classic attention test

06.02.26 | PNAS Nexus

Dissociation between task recognition and task execution in Claude 3.5 Sonnet without an explicit prompt. (a) Screenshot of the unprompted conversation (January 10, 2025) in which the model identifies the Stroop paradigm and generates word-color relationship mappings, yet achieves only 70% accuracy (7 of 10 correct) on an incongruent list. (b) The 10-word incongruent stimulus image provided as the sole input, without accompanying task instructions. This dissociation suggests that recognition of Credit: Suketu Chandrakant Patel, Hongbin Wang, and Jin Fan

Giving AI a classic psychological test reveals an inherent weakness in LLM decision-making abilities. Suketu Patel and colleagues explored how transformer-based machine attention differs from human attention by testing AI models on the “Stroop task,” in which words for colors are printed in colored ink, and participants are asked to name the ink color of each word while ignoring its meaning. The task is clinically used to assess executive control, especially a person’s ability to inhibit an automatic response. Although humans generally take longer to answer correctly when words and colors are mismatched than when they match, they can still perform stably and with high accuracy even on long word lists.

The authors found that when the word and ink color did not match, LLMs performed well with a list of five words. But as the list of words grew longer, AI performance degraded dramatically. GPT-4o dropped from 91% accuracy at 5 words to 57% accuracy at 10 words and 15% accuracy at 40 words. Claude 3.5 Sonnet was stable through 20 words, but crashed to 24% accuracy at 40 words. In trials with a list of words in both matching and mismatched colors, LLM performance was even worse, dropping to near 0% accuracy for the mismatched items. Similar results were found with GPT-5, Claude Opus 4.1, and Gemini 2.5. LLMs struggled to stay focused on naming the color rather than defaulting to word reading. As with humans, LLMs are better trained on word reading than on color naming, yet humans can suppress word reading in long lists and maintain focus on the task at hand. According to the authors, the performance collapse of LLMs suggests fundamental limitations compared with biological attention.

PNAS Nexus

Deficient executive control in transformer attention

2-Jun-2026

Article Information

Journal

PNAS Nexus

Article Publication Date

2026-06-02

Article Title

Deficient executive control in transformer attention

Contact Information

Jin Fan

The City University of New York

jin.fan@qc.cuny.edu

Source

This article is based on a news release from PNAS Nexus. BrightSurf curates and republishes science news from research institutions worldwide; the original release is linked below.

Original Source

https://academic.oup.com/pnasnexus/article-lookup/doi/10.1093/pnasnexus/pgag149

How to Cite This Article

AI fails classic attention test

Keywords

Article Information

Contact Information

Source

How to Cite This Article