The Context Window
An introduction to AI’s working memory, and how it affects the price of RAM.
Last week, Anthropic and OpenAI released new flagship models within minutes of each other, and buried in the technical specs was a number that matters more than most of the benchmark scores the companies like to tout: Anthropic’s Claude Opus 4.6 now offers a one-million-token context window, the first time its top-tier model has reached that threshold. If that sentence means nothing to one, this article is for one, because context windows are one of those under-discussed technical details that shape what AI can and cannot do in practice, and their relentless expansion is contributing to a global memory shortage that is about to make a lot of everyday electronics more expensive.
What even is a context window?
A context window is the amount of information an AI model can hold in its head at one time. It is, for practical purposes, the model’s working memory.
When one types a message into ChatGPT, Claude, or Gemini, the model does not “remember” previous messages the way a person might. Instead, every time it generates a response, it reprocesses the entire conversation from the beginning, plus whatever documents, code, or other material have been loaded into the session. The context window is the hard limit on how much of that material it can consider at once. Exceed it, and the model starts losing information, either by dropping older parts of the conversation or by compressing them into something less precise.
The unit of measurement is the “token,” which is roughly three-quarters of a word in English. A 1,000-word email runs about 1,300 tokens. A 300-page book is roughly 100,000 tokens. So when Anthropic says Opus 4.6 has a one-million-token context window, it means the model can hold approximately 750,000 words in working memory simultaneously: several novels, an entire codebase, or a thick stack of legal filings.
How we got here
The growth has been extraordinary. When GPT-3 launched in 2020, its context window was 2,048 tokens, about 1,500 words, enough for a short conversation and not much else. GPT-3.5, which powered the original ChatGPT in late 2022, doubled that to 4,096. GPT-4 arrived in early 2023 at 8,192 tokens, then expanded to 128,000 in its Turbo variant later that year. Anthropic’s Claude 3 launched in early 2024 with 200,000. Google’s Gemini 1.5 Pro pushed to one million soon after.
Today, the major models sit roughly here: Google’s Gemini offers up to two million tokens, Anthropic’s Opus 4.6 offers one million, and OpenAI’s GPT-5.2 sits at 400,000. The context window has gone from “a few paragraphs” to “a small library” in about three years.
Bigger is not automatically better
This is where things get interesting, and where the gap between marketing and engineering is worth paying attention to.
A model’s advertised context window tells one how much text can be loaded in. It does not tell one how well the model actually uses all that information. Researchers have known for some time that AI models tend to pay the most attention to material near the beginning and end of their context, while information in the middle gets neglected. A 2023 Stanford study documented this pattern, which the researchers called “lost in the middle,” and found that performance could degrade by more than 30% when critical information was buried in the center of a long input. The models displayed a distinctive U-shaped curve: strong recall at the start, strong recall at the end, a valley in between.
This matters because “having a million-token context window” and “reliably finding and using information across a million tokens” are two very different claims. For a long time, they were very different realities. One could load a massive document into a model, and it would technically accept it, but asking a question about something on page 47 of a 200-page report might yield a hallucination rather than an answer.
What changed last week
This is why one specific number in Anthropic’s announcement deserves attention. On the MRCR v2 test, which measures a model’s ability to find multiple pieces of information buried across vast amounts of text (researchers call it a “needle-in-a-haystack” test), Opus 4.6 scored 76% at the full one-million-token mark. Its predecessor scored just 18.5% under the same conditions. Anthropic called it “a qualitative shift in how much context a model can actually use while maintaining peak performance.”
The claim comes with caveats. The score is self-reported and awaits independent verification. Google’s Gemini 3 Pro, which advertises an even larger context window, scored 77% on the same benchmark at 128,000 tokens but dropped to just 26.3% when tested at the full million-token level, according to its own evaluation card. Developers have reported Gemini’s performance degrading well before its advertised capacity is reached.
The context window race is no longer just about raw size. It is increasingly about retrieval quality at scale, and the gap between “accepts this many tokens” and “reliably uses this many tokens” is where the real competition is playing out.
What this actually means for ordinary use
For most people using AI through a chat interface, context windows impose constraints that are invisible until they aren’t. That moment when a chatbot seems to “forget” what one said ten minutes ago, or contradicts something it stated earlier, or produces a vaguely confused summary of a long document? That is often the context window at work, either because the conversation has exceeded it or because the model is struggling to use the space it has.
The practical difference that better context windows make is easier to see through examples than through token counts. A lawyer can load an entire contract into the model rather than feeding it in pieces and hoping the AI remembers the indemnification clause by the time it reaches the liability section. A developer can give the model an entire codebase rather than isolated files, so the AI understands how different components interact. A researcher can paste in ten journal articles and ask the model to synthesize findings across all of them, rather than summarizing each one separately and losing the cross-references.
These scenarios represent the difference between AI as a tool that processes fragments and AI as a tool that processes wholes.
The price everyone else pays
All of this takes memory, and not just inside the AI model. The physical infrastructure required to run these systems, the data centers packed with GPUs and high-bandwidth memory chips, is consuming an increasingly enormous share of the world’s semiconductor production. And that is driving up prices for everything else that uses memory, which is to say, nearly every electronic device one can name.
Context windows are not the sole driver of AI’s memory appetite; training models and running inference both demand vast quantities of specialized chips. But the expansion of context windows is part of the same underlying dynamic: AI systems that process more information simultaneously require more memory to do it. Processing a million tokens in a single pass is not a trivial hardware task. The attention mechanism at the heart of these models scales roughly with the square of the context length, so doubling the window can quadruple the computational work and the memory required to hold it.
The consequences of this hunger are rippling through the global supply chain. TrendForce projects that AI data centers will consume 70% of all high-end DRAM production in 2026. Memory prices rose roughly 50% in the last quarter of 2025, and are forecast to climb another 70% this year. DRAM prices surged 172% through 2025, and Micron has said it is “sold out for 2026.” The knock-on effects are already visible: PC manufacturers are warning of 15-20% price increases, smartphone makers are either raising prices or quietly downgrading specifications to hit familiar price points, and retailers in Japan’s Akihabara electronics district have started rationing memory purchases.
The specialized memory that AI accelerators need, called High Bandwidth Memory or HBM, is particularly demanding to produce: manufacturing one bit of HBM requires forgoing three bits of conventional memory. As Samsung, SK Hynix, and Micron, which together produce about 90% of the world’s DRAM, reallocate production toward AI infrastructure, what analysts at IDC have called a “permanent reallocation” of capacity, the supply of memory for consumer devices contracts. TrendForce senior research VP Avril Wu told the Wall Street Journal she has tracked the memory sector for nearly 20 years, and “this time really is different. It really is the craziest time ever.”
Where this goes
Anthropic also introduced something called “compaction” in the Opus 4.6 release, an API feature that automatically summarizes older parts of a conversation when the context window fills up. Rather than hard-cutting the conversation at the limit, the model compresses earlier exchanges into condensed summaries, preserving the gist while freeing up room. It is, essentially, an artificial version of how human memory works: sharp detail on what just happened, fuzzier recollections of what came before, a sense of the overall thread.
This points toward where context windows are likely headed. The raw numbers will keep growing, but the more important innovation is in how models manage and prioritize information within those windows. The goal is not just a bigger bucket, but a smarter one: a model that knows what to hold onto, what to compress, and what to release.
For now, the practical picture is this: AI’s working memory is getting dramatically bigger and, at last, meaningfully more reliable. That is genuinely useful if one works with long documents, complex projects, or extended conversations. It is also genuinely expensive, not just for the companies running the models, but for anyone buying a laptop, a phone, or a TV in 2026. The insatiable memory of the machine turns out to be everyone’s bill to pay.


