Large Language Models (LLMs) are typically presumed to process context uniformlyâthat is, the model should handle the 10,000th token just as reliably as the 100th. However, in practice, this assumption does not hold. We observe that model performance varies significantly as input length changes, even on simple tasks. In this report, we evaluate 18 LLMs, including the state-of-the-art GPT-4.1, Clau


{{#tags}}- {{label}}
{{/tags}}