Cerebras: Giant Chips for Lightning-Fast AI
How OpenAI is betting $10 billion on the first real alternative to GPUs.
If one uses ChatGPT or Claude today, the experience feels roughly the same regardless of the question: type something, wait a moment, get an answer. What’s invisible is the enormous industrial apparatus behind that moment of waiting, and the fact that the hardware optimized for building AI models turns out to be poorly suited for running them at the speed users increasingly expect. That gap is now large enough to be worth $10 billion, and the company collecting the check is Cerebras Systems, a Sunnyvale outfit most people outside the chip industry have never encountered.
Last month, OpenAI announced a multi-year deal with Cerebras to deploy 750 megawatts of computing power through 2028. Then, last Wednesday, the first tangible product of that partnership arrived: GPT-5.3-Codex-Spark, a lightweight coding model designed for real-time development, generating over 1,000 tokens per second. It is the first time OpenAI has run a production model on non-Nvidia silicon. The diplomatic language from OpenAI was carefully chosen: “GPUs remain foundational across our training and inference pipelines.” But the subtext was clear. For certain workloads, Nvidia’s chips are no longer the best tool for the job.
To understand why, one needs to understand a distinction that rarely makes it into mainstream AI coverage: the difference between training and inference.
Training v.s. inference
Training is the process of building an AI model. It involves feeding enormous datasets through a neural network, adjusting billions of parameters over weeks or months until the model learns to generate useful outputs. This is computationally brutal work: matrix multiplications at staggering scale, requiring thousands of processors working in concert. Nvidia’s GPUs, originally designed for rendering video game graphics, turned out to be spectacularly good at this kind of parallel number-crunching. The company’s dominance in AI training is well-earned and, for now, largely unchallenged.
Inference is different. Inference is what happens when a trained model actually responds to a query, when one asks ChatGPT a question and it generates an answer, token by token. The computational profile is fundamentally unlike training: instead of massive parallel operations on huge batches of data, inference involves sequential generation, producing one token at a time, each dependent on the last. The bottleneck here is not raw computational power but memory speed, specifically how fast the processor can access the model’s parameters for each step in the sequence.
This is where the economics get interesting. Nvidia’s GPUs use a type of memory called HBM (High Bandwidth Memory), which stacks DRAM chips vertically to move large amounts of data quickly. HBM is excellent for training workloads, where the system needs to process massive batches of data in parallel. But for inference, where the model needs to fetch parameters rapidly for each individual token, HBM introduces latency that SRAM (Static Random Access Memory) does not. SRAM is far faster at individual data lookups, roughly 1,000 times faster than even the latest HBM. The tradeoff is that SRAM takes up vastly more physical space per gigabyte and costs far more to produce.
For decades, this tradeoff made SRAM impractical for anything beyond tiny processor caches. Cerebras decided to change the terms of the equation by making its chip very, very large.
Don’t cut the wafer!
In conventional chip manufacturing, processors are printed onto a silicon wafer (a circular disc about 12 inches across) in a grid pattern, then cut apart into individual chips. Defective chips get discarded; working ones get packaged and sold. The entire semiconductor industry has operated this way for half a century.
Cerebras, founded in 2016 by a team of engineers who had previously built and sold a server company called SeaMicro to AMD, asked a simple question: what if you didn’t cut the wafer? What if the entire thing, all 46,000-plus square millimeters of it, became a single chip?
The idea was not new. Companies had attempted wafer-scale integration as far back as the 1980s, including a high-profile effort by Gene Amdahl that collapsed after burning through hundreds of millions in investment. The problem was always yield: silicon manufacturing inevitably produces defects, and the larger the chip, the more certain it is that some portion will be faulty. When each chip is small, one simply discards the broken ones. When the chip is the entire wafer, every wafer ships with defects baked in.
Cerebras solved this by engineering redundancy into the design, building the chip so that defective cores could be bypassed without compromising the whole system. The resulting product, the Wafer-Scale Engine, is roughly the size of a dinner plate. The current third-generation version, the WSE-3, contains 4 trillion transistors, 900,000 AI-optimized cores, and 44 gigabytes of on-chip SRAM. For comparison, an Nvidia H100 GPU contains about 80 billion transistors and roughly 50 megabytes of on-chip SRAM, relying instead on 80 gigabytes of external HBM.
The 44 gigabytes of SRAM is the key number. Because the memory lives directly on the chip, right next to the compute cores, there is almost no latency when the processor needs to fetch model parameters during inference. The data is already there, positioned within fractions of a millimeter of where it needs to be used. On a GPU-based system, inference requires constantly shuttling data between external HBM and the processor, which, fast as it is, introduces delays that accumulate across billions of operations.
The practical result, according to Cerebras, is inference speeds up to 15 times faster than GPU-based systems for certain workloads. OpenAI’s Codex-Spark, the first model to take advantage of this, delivers over 1,000 tokens per second, fast enough to make code generation feel essentially instantaneous.
Why now?
Cerebras has been around since 2016, and the WSE has been available in various forms since 2019. So why is the company only now landing a deal of this magnitude?
The answer lies in how AI usage patterns have shifted. For years, training was the headline cost: companies spent hundreds of millions building models, and inference was treated as a relatively cheap afterthought. That ratio has inverted. OpenAI reportedly spent nearly $4 billion on inference alone in 2024, and as products like ChatGPT, Claude, and their competitors scale to hundreds of millions of users, inference costs are becoming the dominant line item. A Google research paper from January put it starkly: memory and interconnect, not compute power, are now the primary bottlenecks for serving large language models.
There is also a product experience argument. The emergence of agentic coding tools that work alongside developers in real time, the kind of workflow Codex-Spark targets, demands response times measured in milliseconds, not seconds. When a developer is iterating on code, even a two-second delay breaks the flow. At 1,000 tokens per second, Cerebras hardware makes that delay vanish. “Just as broadband transformed the internet, real-time inference will transform AI,” Cerebras CEO Andrew Feldman told CNBC when the deal was announced. That’s a CEO-grade soundbite, but the underlying point is reasonable: there are entire categories of AI application that only become viable when responses are near-instantaneous.
The tradeoffs
None of this means Nvidia is in trouble, and the careful language from both OpenAI and Cerebras makes that plain. GPUs remain indispensable for training, where their combination of massive HBM capacity and flexible architecture has no real competitor. At CES in January, Nvidia CEO Jensen Huang acknowledged that SRAM-based designs can be “insanely fast” for some workloads, while arguing that Nvidia’s flexibility across a range of workloads remains more economically sound in shared data center environments.
He has a point. Cerebras’s 44 gigabytes of SRAM, while blazingly fast, is a fraction of the memory available on a modern GPU system. Nvidia’s upcoming Rubin GPUs will ship with 288 gigabytes of HBM4. For very large models, the memory capacity constraint means Cerebras hardware requires pipeline parallelism across multiple wafers, which works but adds complexity. And SRAM’s space requirements mean that the cost per gigabyte remains vastly higher than HBM.
The broader picture is not Cerebras versus Nvidia but rather the beginning of hardware specialization in AI. For years, GPUs served as the universal tool for everything: training, inference, large models, small models, batch processing, real-time interaction. What the Cerebras deal signals is that this era of one-size-fits-all is ending. Different workloads have different performance profiles, and the hardware stack is starting to fragment accordingly.
OpenAI’s broader chip strategy reflects this. The company now has deals with Nvidia, AMD, Broadcom, and Cerebras, each optimized for different parts of the pipeline. GPUs for training. Cerebras for latency-sensitive inference. Custom ASICs, reportedly in development with Broadcom, for whatever comes next. It is less a shift away from Nvidia than an acknowledgment that no single chip architecture can optimize for everything.
What this means for the rest of us
For people who are not chip designers or data center architects, the practical implication is straightforward: the AI tools one uses are about to get noticeably faster, and the reason is not smarter models but faster hardware. Codex-Spark is the first product to benefit, but OpenAI has indicated it plans to bring larger frontier models onto Cerebras hardware as capacity scales up.
The deeper significance is structural. AI’s shift from a training-dominated to an inference-dominated industry changes which companies matter, which technical problems are hardest, and where the money flows. For a decade, the AI hardware story was essentially the Nvidia story. That is still substantially true, but the Cerebras deal suggests the next chapter will be more crowded, more specialized, and defined less by who can build the biggest models than by who can run them at speeds that make new applications possible.
Cerebras raised $1 billion at a $23 billion valuation earlier this month and is reportedly preparing to refile for an IPO. Their biggest chip is the size of a dinner plate and contains 4 trillion transistors. Whether the company becomes a fixture of the AI landscape or a fascinating footnote will depend on whether the inference-speed thesis holds as models continue to evolve. But for now, in an industry where the infrastructure seems to shift by the week, the Cerebras deal is one of the most interesting moves yet.



Very interesting.
The innovation is good. PC gaming used to push this, but its been static for 15 years.
Thanks a lot for sharing such an interesting move in the AI industry and also explaining the relevant technicalities in great detail.