Skip to main content

Chroma Research Warns of ‘Context Rot’ as LLMs Falter with Long Inputs

Created on July 25|Last edited on July 25
A new report from Chroma Research highlights a creeping issue in today’s large language models: performance decay as input length increases. The phenomenon, which Chroma terms “Context Rot,” describes how models like GPT-4.1, Claude 3.5, and Gemini 2.5 degrade in reliability as they process more tokens, even when the task itself remains simple. This problem affects both closed and open models and could have serious implications for developers who rely on these systems for production tasks.

The False Promise of Uniform Context Understanding

Modern LLMs advertise massive context windows, often in the hundreds of thousands or even millions of tokens. But Chroma’s findings contradict the assumption that these models handle each part of their context equally well. Accuracy and reliability often drop as relevant information gets pushed deeper into the input. Chroma tested 18 models and consistently found that token position within the prompt has a strong impact on whether the model can use that information correctly.

Challenging Standard Retrieval Benchmarks

Traditional retrieval tests like Needle in a Haystack are limited, rewarding only surface-level pattern matching. Chroma constructed harder versions that included semantically similar but incorrect distractors or ambiguous phrasing. These revealed deeper issues. As context grew, models failed not just to retrieve information, but to understand it. Small shifts in distractor wording or document structure dramatically affected performance, pointing to weaknesses in the model’s internal reasoning over long input.

Semantic Complexity and Confounding Input

The report also examined how distractor passages and semantic ambiguity degrade LLM output. Models performed worse when the distractors were closely related in topic to the target information. Ironically, input that was logically structure, such as a narrative or well-organized documentation, caused more confusion than jumbled or randomly ordered text. This suggests that internal attention mechanisms may get misled by the surface coherence of the input.

Task Simplicity Doesn’t Prevent Decay

To further test the limits of context retention, Chroma used trivial instruction-following tasks like asking the model to repeat a word that appears late in the context. Even on these straightforward challenges, performance dropped as input length increased. In many cases, models responded with unrelated words, made up answers, or gave no answer at all. These breakdowns occurred long before models reached their token limits, showing that sheer capacity doesn’t equate to utility.

Implications for Developers and Researchers

These findings carry major weight for those building tools with LLMs. Relying on a model’s stated context size alone is misleading. High scores on benchmarks like NIAH may mask deeper fragility when tasks involve real reasoning or understanding. Developers must treat prompt construction, input order, and token placement as core design issues, not just formatting choices. For researchers, the results signal that today’s evaluation frameworks may not reflect actual user scenarios.

The Open Question Behind the Decay

Despite documenting the pattern clearly, Chroma offers limited answers as to why context rot occurs. The likely culprits are model architecture and attention behavior, but detailed mechanisms remain unknown. This opens new areas for interpretability and architecture research, especially as developers seek models that not only remember more but use that memory effectively.

Chroma Urges New Evaluation Methods

The report closes with a direct challenge to the industry: stop assuming bigger token windows equal better results. New evaluations must test real-world tasks, where understanding matters more than retrieval. Chroma is calling for improved benchmarks and deeper investigations into how structure, semantics, and distractors influence outcomes. Without that, the promise of long-context models risks collapsing under its own weight.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.