Mr. Aayush Bhatt
June 18, 2026 · 12 min read
AI Has a Major Attention Problem — New Research Finds Models Collapse on Long, Complex Tasks
A psychology attention test exposed a critical AI flaw. Models scored over 90% on short tasks — then collapsed to near zero as lists grew longer. Here's what it means.
Introduction: A Test Designed to Find Exactly This
Psychologists have used the Stroop task for nearly a century. The test is deceptively simple: you are shown a list of words — color names like RED, BLUE, or GREEN — but each word is printed in a different ink color. Your job is to name the ink color, not read the word. When the word says RED but is printed in blue ink, your brain has to suppress its automatic tendency to read the word and instead report what it actually sees. It sounds easy. It is not. The harder your brain has to fight its own strongest impulse, the more errors accumulate and the slower you go.
Researchers led by Suketu Patel recently adapted this classic test for a different kind of subject: large language models. Their goal was not to see whether AI could name colors. Their goal was to test whether AI could maintain an instruction consistently across a long, conflicting list — the same cognitive demand the Stroop test places on human attention and executive control. The study was published in the journal PNAS Nexus and featured on ScienceDaily on June 10, 2026. The results revealed something that every person who uses AI tools for real work should understand before their next long-form task.
What the Researchers Did
The setup was direct. Suketu Patel and colleagues tested two leading AI models — OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet — using visual images of color-word lists presented under the standard Stroop conditions. They ran three conditions: congruent (where the word and the ink color matched), incongruent (where they conflicted), and neutral control. The key experimental manipulation was the progressive increase in list length — from one word all the way up to forty items. The models were asked to report either the ink color names or the words themselves, depending on the condition being tested.
The question being asked was not whether the models could pass the test on a single item. It was whether they could maintain the required instruction — name the ink color, not the word — consistently as the list grew longer and the cumulative conflict between what the word said and what the ink showed mounted over many items in sequence. That sustained attention demand, maintained against a competing automatic response, is exactly what the Stroop task is designed to measure in humans.
The results were unambiguous. When word and ink color did not match and the list contained only five words, the models performed well, correctly naming the ink colors in the vast majority of cases. As the list grew longer, performance deteriorated sharply. Some leading systems fell from over 90 percent accuracy on short lists to nearly complete failure on longer ones. By the time the list reached its maximum length, accuracy for mismatched items dropped to near zero in some cases. The models had effectively stopped following the instruction to name ink colors and had defaulted to reading the words — the very response the task requires them to suppress.
Why This Happened — and What It Reveals About How AI Attention Works
The finding makes more sense when you understand the fundamental difference between how human attention and transformer-based machine attention work.
Human attention is managed by what neuroscientists call executive control — a cognitive system that can prioritize competing demands, suppress automatic responses, and maintain task-relevant goals even under conflict. When you look at a list of thirty mismatched color-words, your executive control system actively holds the instruction "name the ink color" in working memory and deploys it to override your reading reflex for each item in sequence. It degrades under pressure — humans slow down and make more errors on mismatched items than matched ones — but it remains functional across long lists because it is a dynamic, active system, not a passive pattern matcher.
Transformer-based AI models work on a fundamentally different principle. They process language through attention mechanisms that distribute focus across the tokens in a sequence — the individual words and word-pieces that make up any input. The word "attention" in the AI context is a technical term describing how a model weights different parts of its input when generating each output. It is not the same thing as the sustained, goal-directed executive attention that the Stroop task measures in humans. As the input list grows longer, the model's attention becomes distributed across more and more tokens. The instruction given at the beginning of the task — name the ink color, not the word — competes with the statistical patterns the model was most heavily trained to produce. Reading words is what these models have done across trillions of examples. Suppressing that response in favor of a contextual instruction is not their natural default, and they cannot maintain it consistently as the task scales.
The researchers described this precisely: the AI systems appeared unable to consistently suppress the response they had been most heavily trained to produce. That phrasing is important. It is not that the models forgot the instruction. It is that the automatic response their training built — read and process written words — eventually won out against the explicit contextual instruction as the task grew longer and the conflict mounted.
What This Means for Real-World AI Use
The Stroop task is a laboratory test. The implications of its results are not confined to a laboratory.
Every time you ask an AI model to perform a long, multi-step task involving conflicting demands, you are placing the same cognitive pressure on the system that the Stroop task reveals it cannot consistently handle at scale. Consider the kinds of tasks where this matters most.
Reviewing a long legal document requires an AI to maintain a specific instruction — flag any clause that creates liability — across dozens of pages while ignoring hundreds of irrelevant but verbally prominent provisions. As the document grows longer, the model faces the same cumulative conflict the Stroop task creates: the instruction must be suppressed against the model's default behavior of processing and responding to whatever is most salient in each section. The study's findings suggest this is exactly the condition under which AI performance deteriorates.
Summarizing a long research report while excluding a specific category of information — "summarize the findings but do not include the discussion of statistical limitations" — requires sustaining an exclusion instruction across a large and conflicting input. As the report grows longer and the excluded material appears repeatedly, the model's ability to maintain that exclusion degrades in the same way its ability to maintain the ink-color instruction degraded in the Stroop test.
Multi-step coding projects, long-context analysis tasks, document comparison across multiple sources, and any workflow where an AI must hold a complex set of constraints in place across extended output are all affected. The failure mode is not random error. It is systematic drift toward default behavior as the task grows longer and the required instruction competes more intensively with what the model's training most strongly predicts.
This is also why AI models sometimes seem to lose track of instructions given early in a conversation when that conversation has grown very long. The instruction is not lost in any simple sense. The model's ability to weight it appropriately against competing inputs deteriorates as the context window fills up with material that the model's training associates with different responses.
Which Specific AI Tasks Are Most Affected
The tasks most vulnerable to this attention failure share a common structure: they require the model to maintain a consistent instruction across a long input, while the input itself contains competing material that the model's default behavior would process differently.
Long document review is the clearest example — whether legal, financial, medical, or technical. Contract analysis requiring extraction of specific clause types across a hundred-page document, financial report review requiring consistent application of a specific evaluative criterion across many sections, medical record analysis requiring consistent focus on one category of information while ignoring others. In all of these, the combination of length and competing material creates exactly the Stroop-like conflict the study found AI models cannot sustain.
Multi-step coding tasks across large codebases face the same vulnerability. The instruction "do not modify the authentication module" must be maintained across thousands of lines of code that include that module and many references to it. As the task grows, the model's attention to that constraint degrades relative to its default behavior of completing the task by the most direct route.
Editing tasks with specific style constraints — "rewrite this in plain English without technical jargon" — become progressively harder to maintain consistently across a long document, as the model's training toward producing technically accurate language competes with the plain-English constraint. The beginning of a long document will typically be edited more accurately than the end.
What Developers Are Doing to Address the Problem
The AI research and engineering community has been aware of this category of problem — often described as context length degradation or instruction following at scale — and is actively developing countermeasures, though none of them fully solve the underlying issue yet.
Chunking is the most common practical mitigation: breaking long tasks into shorter segments and processing each segment independently, then synthesizing the results. This directly addresses the length dimension of the Stroop-like failure by preventing the accumulation of conflicting tokens that causes attention drift. The limitation is that it requires the user or the application to handle the chunking logic, and some tasks genuinely cannot be decomposed without losing the context that makes them coherent.
Explicit instruction reinforcement — repeating the key constraint at regular intervals within a long prompt — is a technique that some prompt engineers have found effective. The equivalent in the Stroop test would be reminding the subject after every fifth word that they should name the ink color, not read the word. It helps, but it also reveals the underlying problem: a system with robust executive control would not need to be reminded.
Architectural research into how transformer attention handles long sequences is ongoing. Some newer architectures are designed to maintain instruction salience more consistently across long contexts, but these remain research-stage rather than production-deployed at the scale of GPT-4o or Claude. The fundamental tension between training-driven default behavior and contextual instruction following is a property of the current dominant architecture, not a bug that a configuration change will fix.
What Every AI User Should Know Before Trusting AI With a Complex Task
The practical implication of this study is not that AI models are useless for long or complex tasks. It is that they are not equally reliable across the full length and complexity range of tasks you might assign them, and the failure is systematic and predictable rather than random.
For short, focused tasks — summarizing a single document section, answering a specific question about a limited set of information, generating a short piece of content within defined constraints — AI models perform well and the attention failure the Stroop study reveals is not meaningfully triggered. The problems the study documents emerge specifically as task length and instruction-content conflict increase simultaneously.
For longer tasks, the most important practical step is verification — not a cursory check at the end, but systematic spot-checking across the full length of the output. If you asked an AI to review a fifty-page document for a specific category of issue, check its output in the first five pages, the middle twenty pages, and the last five pages separately. The beginning will almost certainly be more accurate than the end, and errors in the middle or end of a long output are where the Stroop-like drift manifests.
Breaking long tasks into structured shorter tasks and recombining the results gives AI models the best chance of maintaining instruction accuracy across the full scope of the work. This requires more effort from the user than a single long prompt, but it is currently more reliable than trusting a single long-context run to maintain consistent instruction following across every section.
Most importantly, do not calibrate your trust in an AI model based exclusively on its performance on a short demonstration. A model that correctly summarizes three paragraphs with a specific constraint does not thereby prove it can apply the same constraint across thirty pages. The Stroop study makes explicit what experienced AI users have observed empirically: the same model that scores over 90 percent on the short version can approach zero on the long version of the same task.
Conclusion: AI Attention Is Not Human Attention — And the Difference Matters
The Stroop study published in PNAS Nexus and reported on ScienceDaily on June 10, 2026 is not a finding that AI is broadly unreliable. It is a finding that AI attention and human executive control are fundamentally different systems, and that the difference becomes consequential in a specific and identifiable set of conditions: long inputs, conflicting demands, and instructions that must override training-dominant responses.
Humans have a specific cognitive architecture for managing sustained attention against competing impulses. We are slow at it. We get tired. We make more errors under high conflict than low conflict. But we maintain functional accuracy across long lists because we have executive control mechanisms that actively enforce the task goal across the full sequence. Current transformer-based AI models do not have an equivalent mechanism. Their attention is distributed and contextual, not goal-directed and sustained in the same sense. The Stroop task is the simplest possible demonstration of that difference.
For the tens of millions of people now using AI tools for real work — reviewing documents, writing code, analyzing data, managing information across complex projects — the most valuable takeaway is this: AI is not uniformly reliable across task length. Verify long outputs. Break complex tasks into shorter segments. And treat a model's performance on the first section of a long task as a starting estimate, not a guarantee of what you will find at the end.
The model did not lose the instruction. It lost the fight against its own defaults. That is a problem worth understanding before your next long task.
Written by
Mr. Aayush Bhatt
Software Engineer interested in how models work and where they fail.