The million token models

Context window size is no longer a distinguishing feature.

Apr 07, 2026

Context windows have been a headline specification in AI model releases since GPT-4 pushed past 32K tokens in 2023. The race to one million ended in early 2026, with five frontier models reaching that threshold in a single quarter. How well a model uses its context window now matters more than how large that window is, and the differences between providers are larger than the spec sheets suggest.

Sticker numbers

Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, Qwen 3.6 Plus, and Llama 4 Maverick all offer one-million-token context windows. Meta’s Llama 4 Scout claims ten million, roughly equivalent to 15,000 pages of text, though independent testing suggests that Scout’s effective recall degrades past one million tokens. In early 2024, a 128K context window was exceptional. The number has grown by an order of magnitude in roughly eighteen months.

Advertised capacity and effective capacity are different measurements. Anthropic’s MRCR v2 benchmark tests whether a model can track entities and relationships across a full context window. Opus 4.6 scores 78.3% at one million tokens. Gemini scores 26.3% on the same test, a threefold gap between models with identical advertised windows. Princeton NLP’s HELMET benchmark found that most models degrade on summarization tasks past 32K tokens, well below the numbers on the label.

Before generating its first output token, a model must process the entire input. This prefill phase can exceed two minutes at maximum context. That latency suits batch processing and overnight pipelines, and rules out anything interactive. Models also underweight content in the middle of the window relative to the beginning and end, making prompt structure an architectural decision for production systems.

Flat rates and fine print

Pricing has started to reflect these differences. Anthropic eliminated its long-context surcharge on March 13, making a 900K-token request cost the same per-token rate as a 9K one: $5 per million input tokens for Opus 4.6, $3 for Sonnet 4.6, across the full window. Gemini 3.1 Pro offers flat pricing at $2 per million input tokens with no tiered structure.

GPT-5.4 takes a different approach. OpenAI charges $2.50 per million input tokens for the first 272K tokens and doubles to $5.00 beyond that threshold. The premium applies retroactively to the full session once a request crosses the limit. Uniform rates signal that a provider expects production workloads at full context and trusts its model to perform there. Tiered pricing hedges against that expectation, steering developers toward shorter requests.

After the arms race

The million-token window became a commodity faster than any prior model capability. Recall accuracy across the full window, prefill latency at scale, and pricing structure now separate the leaders from the field. A model’s MRCR score and pricing curve reveal more about its long-context viability than the token count on its spec sheet.

Jefferson Kim

13h

Thank you for sharing the drop off. That makes so much sense. The annoying thing with the Claude $20 plan is that once you start hitting those large context windows, you can maybe get like 4-5 questions before it hits the limits.

That's a major tell for Gemini and something I noticed as well day-to-day.

David Karnok

15h

It will become irrelevant once the LLMs introduce automatic conformal compaction and graph-based exact network anchoring.

Like you restart the window but paste a short bootstrap the LLM itself generated as summary in the previous thread by the same LLM operational matrix. Perhaps link to the exact thread with the caveat as not to threat it as tokens but as a search opportunity for details via agents.

This is how Penrose's Cyclic Conformal Cosmology (CCC) works in the big picture. Use the previous runs' databases (evaporated black holes) to seed a new beginning in short and with optional exact backlinks, without expanding them immediately into the current aeon.

The Cosmic Microwave Background (CMB) we see is the conclusion of the previous aeon's developments and progress.

AI Central

Discussion about this post

Ready for more?