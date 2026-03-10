Three weeks ago, a company named Taalas emerged from stealth with a simple proposition: take a finished AI model, encode its weights permanently into silicon, and ship the result as a PCI-Express card. The resulting hardware requires no HBM, no liquid cooling, no advanced packaging, and no CUDA, because the model doesn’t run on the chip. The model is the chip.

The company’s first product, the HC1, hardwires Meta’s Llama 3.1 8B into an 815 mm² die fabricated on TSMC’s 6nm process. Taalas claims that it generates roughly 17,000 tokens per second per user, a 74x improvement over Nvidia’s H200 and an 8–9x improvement over Cerebras, the previous speed leader for that model. The card draws about 200 watts. Ten of them fit in a standard air-cooled two-socket x86 server consuming 2,500 watts total, a figure that would barely register as a rounding error in the megawatt GPU clusters that currently define the inference landscape.

These are self-reported benchmarks, as independent testing has not yet been published. Nevertheless, the architectural novelty deserves attention regardless of whether the HC1’s specific numbers survive external evaluation.

Eliminating the Wall

Modern AI accelerators devote roughly 90% of their energy to moving data rather than processing it. This energy and time spent on shuttling model weights between storage and processor is the core constraint on the performance of inference hardware and is known as “the memory wall”. HBM stacked DRAM exists to mitigate this bottleneck, but it introduces its own costs: expensive packaging, power-hungry I/O, supply constraints, and the liquid cooling infrastructure required to manage the resulting thermal output. Each generation of GPU and accelerator design represents an increasingly elaborate workaround for a problem that Taalas claims to have eliminated by refusing to accept its premise.

Readers who followed our coverage of Cerebras will recognize the memory wall as the same constraint that wafer-scale architecture addresses, but the two companies attack it from opposite ends of the design space. Cerebras solved the problem by making the chip enormous: a wafer-sized die with 44 gigabytes of on-chip SRAM, fast enough to keep model weights within fractions of a millimeter of the compute cores that need them. The weights are still loaded as software onto a general-purpose chip capable of running any model that fits, including training workloads. Taalas eliminates the software layer entirely, since the weights are not stored in fast memory near the compute. Instead, the weights are the compute, encoded at the transistor level with no separation between storage and processing.

Cerebras delivers roughly 1,900 tokens per second per user on Llama 3.1 8B, whereas Taalas claims approximately 17,000. Cerebras dramatically reduced the memory-compute distance, while Taalas claims to have collapsed it to zero. The trade-off is correspondingly extreme: Cerebras can run any compatible model and retrain on new data, while the HC1 can execute the single model etched into its silicon at blinding speed. The desired workload will determine which approach is more practical.

Taalas’ architecture, which the company calls Hard Coded Inference, pairs a mask ROM recall fabric (where model weights are permanently etched) with SRAM (which handles the KV cache, LoRA adapters for fine-tuning, and a configurable context window). CEO Ljubisa Bajic, speaking to Timothy Prickett Morgan at The Next Platform, described the key innovation as the ability to store four bits and perform the associated multiply operation within a single transistor. The entire transistor-level design was done from scratch with no off-the-shelf IP blocks, in what Bajic calls a throwback to 1970s hand-layout methodology. Just one Taalas engineer works part-time on software.

This simplification extends to the manufacturing process. When the underlying model changes, only two of the chip’s roughly 100 metal interconnect layers need to be modified while the silicon base remains unchanged. Working with TSMC, Taalas claims that it can begin shipping deployable inference cards approximately two months after receiving a new model’s weights. The company describes this rhythm as a “seasonal hardware cycle” analogous to fashion rather than the multi-year cadence of traditional chip development.

The Flexibility Question

Locking a model into silicon sounds like a fatal inflexibility in a field where new models arrive weekly. Taalas’ solution is the distinction between the pace of model release and the pace of model deployment. Research labs constantly publish new architectures, but production workloads tend to settle on specific model versions for months or years. Organizations that have fine-tuned a model, built applications around its behavior, and validated its outputs against their requirements do not switch on every release cycle. Morgan noted in his analysis that users complained when OpenAI moved them from GPT-4.5 to GPT-5 because they preferred what they had. A two-month turnaround, if it holds at scale, could plausibly match the deployment cadence of organizations that update their production models quarterly or semiannually.

The HC1 carries limitations beyond model lock-in. Its proprietary 3-bit base format (combined with 6-bit parameters) was adopted because standardized low-precision formats were unavailable when the design began, and the aggressive quantization introduces quality degradation relative to the same model running at higher precision on GPUs. Heise’s evaluation flagged this as a concern for tasks more demanding than casual conversation. The second-generation HC2 chip, expected in late 2026 or early 2027, adopts standard 4-bit floating-point formats to address this. Finally, the Llama 3.1 8B model itself is approaching two years old; a quantized version runs (slowly) on a Raspberry Pi 5. The real proof of concept will arrive when Taalas can ship frontier-scale models across multi-chip configurations using pipeline parallelism, which is a categorically more difficult engineering challenge involving reticle limits and interconnect bottlenecks that a mere 8B-parameter model on a single large die simply does not encounter.

The Competitive Landscape

Taalas enters a market that has shifted dramatically in recent months. Nvidia’s acquisition of Groq’s intellectual property in December for approximately $20 billion, along with the absorption of much of Groq’s design team, validated the market for specialized inference hardware while simultaneously eliminating one of Taalas’s closest competitors as an independent entity. SambaNova has drawn acquisition interest from Intel, and SoftBank absorbed Graphcore for $600 million in mid-2024. The category of independent inference-chip startups is rapidly consolidating.

Meanwhile, the scale of investment in general-purpose inference continues to grow. The AMD-Meta agreement announced on February 24 commits up to 6 gigawatts of custom MI450-based GPU infrastructure in a deal estimated at up to $100 billion over several years. Meta struck a separate deal with Nvidia the week prior for millions of Blackwell and Rubin GPUs. These organizations are spending nation-state levels of capital on general-purpose accelerators, and the gravitational pull of that investment (software tooling, talent pipelines, supply chain priority) creates ecosystem effects that specialized startups cannot replicate.

Taalas’ innovation brings a fresh perspective on the entire paradigm of AI hardware. The company has coined the term “programmability tax” to describe the overhead of flexibility that general-purpose accelerators are required to carry and that inference workloads by their very definition have no need for. Training demands the ability to modify weights, explore architectures, and iterate on gradients, whereas inference merely repeats the same computation billions of times after the model has been finalized. The cost of unnecessary flexibility is increasingly compounded as the industry shifts toward inference-dominated workloads, and the economics suggest that trend is well underway by now. At Taalas’ claimed cost of 0.75 cents per million tokens, the company would undercut even the most aggressive mainstream pricing. Nvidia’s Blackwell platform achieves roughly two cents per million tokens on comparable open-source models, and Groq’s LPU-based inference prices Llama 3.1 8B at six cents.

If that cost gap proves real and scalable, the consequences would extend beyond datacenter procurement. The inference market’s dependence on GPUs and HBM is a primary driver of demand and prices for both. Specialized inference silicon handling a growing share of production workloads would allow GPU demand to be increasingly concentrated on training, and the HBM supply crunch that has constrained the industry for two years could begin to ease. Nvidia’s Groq acquisition positions the company to offer specialized inference alongside its general-purpose platform, hedging against this exact bifurcation.

The Team and the Budget

The pedigree of the team behind Taalas lends more weight to these claims than a typical stealth-mode announcement might otherwise warrant. Ljubisa Bajic founded Tenstorrent before departing when Jim Keller took the CEO role in late 2022; prior to that he served as architect and senior manager of hybrid CPU-GPU designs at AMD, with a stint as senior architect at Nvidia. COO Lejla Bajic (his wife) and CTO Drago Ignjatovic have each worked alongside Bajic since AMD, following him to Tenstorrent and again to Taalas. VP of Products Paresh Kharya spent three years as senior director of product management for Nvidia’s datacenter business before moving to Google Cloud to manage GPU and TPU infrastructure. The full team numbers 25 members drawn from AMD, Apple, Google, Nvidia, and Tenstorrent. Three funding rounds totaling over $219 million (led by Quiet Capital, with participation from Fidelity, Sumitomo, and veteran semiconductor investor Pierre Lamond) have left the company with more than $170 million unspent, having reached launch on $30 million of R&D expenditure.

That figure itself stands out in the field of modern semiconductor development. It demonstrates that a small team of experienced engineers, unconstrained by the organizational overhead and legacy commitments of a large company, can innovate faster while spending less than the prevailing wisdom about chip development costs would predict.

ENIAC and After

Bajic compares current GPU-based AI infrastructure to ENIAC: a room-filling, power-devouring prototype that proved the concept of electronic computation while simultaneously representing everything that would need to change before computing could become ubiquitous. While entertaining, the analogy is probably slightly premature. ENIAC’s successors scaled for decades along a well-understood physical trajectory, while Taalas is proposing a parallel track whose scaling properties are unproven. The company has demonstrated that an 8-billion-parameter model can be hardwired onto a single die with extraordinary throughput. It has yet to prove that a 70-billion or 200-billion-parameter model can be distributed across multiple chips with comparable efficiency, that the two-month manufacturing turnaround can survive production volume, or that output quality at aggressive quantization levels can satisfy demanding enterprise workloads.

Taalas has shown that the memory wall is not a law of physics but rather a consequence of using general-purpose processors in the place of dedicated inference hardware. Whether Taalas captures the new market this insight will create, or merely forces the incumbents to absorb its lesson, a plausible path now exists to eliminating 90% of the energy demands of everyday AI activity.