Understanding Cerebras: Computing Power Drives AI Thinking, Memory Empowers Agents to Act

Author: Ben Thompson

Computing power enables AI to learn to think, memory allows Agents to learn to work.

This week, Cerebras went public, and Ben Thompson’s latest article explains it thoroughly: AI evolving from “chatting” to “autonomous task execution,” with the bottleneck in chip architecture changing.

When you chat with Doubao, you’re waiting for speed; when Kimi Claw runs a task for you for 5 hours, it doesn’t care if it’s 3 seconds faster or 30 seconds slower—what matters is whether it can remember the context and sustain work. With each step, working memory (KV Cache) expands by a layer. GPUs are designed for “waiting in front of the screen”: idle VRAM during prefill, idle compute during decoding—half the time just waiting.

The real bottleneck isn’t how fast you can compute, but how much you can store and how quickly you can read it out. More fundamentally, long-running agents turn KV Cache from a temporary buffer into persistent working memory. Whoever can make this memory last longer, reuse more efficiently, and cost less, holds the key to the agent economy.

This is far more important than benchmark scores.

Regarding the timing of going public, launching a chip company in May 2026 is almost ideal. Reuters reported over the weekend:

Two insiders told Reuters on Sunday that, driven by sustained market demand for this AI chip company’s stock, Cerebras Systems is likely to raise its IPO size and price on Monday. The sources said the company is considering raising the offering price range from the originally planned $115–125 per share to $150–160, and increasing the number of shares from 28 million to 30 million; both sources requested anonymity as the information has not been publicly disclosed.

The ongoing rise in semiconductor stocks is fundamentally driven by AI—especially as the market gradually realizes: Intelligent agents will consume vast compute resources. But Cerebras points to a broader proposition: until now, the AI compute narrative has been almost solely about GPUs, mainly Nvidia; in the future, the landscape will become increasingly heterogeneous.

GPU Era

The story of how GPUs became central to AI is well known. Briefly:

  • Just as rendering pixels on a screen is a parallel process—more processing units mean faster graphics rendering—AI computation is similar: the number of processing units directly determines speed.

  • Nvidia has capitalized on this “dual purpose”: making graphics processors programmable (with CUDA), and building a complete software ecosystem that brings this programmability to all developers.

  • The fundamental difference between graphics and AI lies in problem scale—models are far larger than textures in video games. This has led to two linked evolutions: a dramatic expansion of high-bandwidth memory (HBM) capacity on a single GPU; and major breakthroughs in chip-to-chip interconnects, enabling multiple chips to work as an addressable system. Nvidia leads in both.

  • The primary use case for GPUs has always been training, which is especially demanding on the third point. Each training step is highly parallel internally, but serial between steps: before moving to the next, each GPU must synchronize its results with all others. That’s why a trillion-parameter model must fit into tens of thousands of GPUs’ total memory—and these GPUs must communicate as a single machine. Nvidia has dominated both challenges: securing HBM supply ahead of the industry, and investing heavily in networking tech.

Of course, training isn’t the only AI workload; another is inference. Inference involves three main parts:

1. Prefill: Encoding all the information a large language model (LLM) needs to understand into a comprehensible state; highly parallel, compute-intensive.

2. Decode Part 1: Reading the KV Cache—which stores context, including outputs from the prefill phase—for attention calculations. This is a bandwidth-critical serial step, with variable and growing memory demands.

3. Decode Part 2: Forward computation on model weights; also bandwidth-critical and serial, with memory needs determined by model size.

These two decoding steps alternate across model layers (interleaved rather than purely sequential), meaning decoding is serial and memory bandwidth-bound. Each token generated requires reading from two different memory pools: the KV cache (growing with each token, storing context) and the model weights. Both must be fully read to produce a single output token.

GPUs handle these needs perfectly: providing high compute for prefill, ample HBM for KV cache and weights, and memory pooling via interconnects when a single GPU’s memory is insufficient. In other words, architectures suited for training are also suitable for inference—evidenced by SpaceX’s deal with Anthropic:

“We have signed an agreement to use all compute capacity at SpaceX Colossus 1 data center. This gives us over 300 MW of new capacity (more than 220k Nvidia GPUs). It will directly enhance the service for Claude Pro and Claude Max users.”

SpaceX retains Colossus 2—presumably for future model training and current inference. Their ability to do both in the same data center is because xAI models are currently modest in size; more critically, because both training and inference can be done on GPUs. The GPUs Anthropic signed on initially belonged to Colossus 1 and were originally used for training; their flexibility is a huge advantage.

Decoding Cerebras

Cerebras’ approach is entirely different. Although silicon wafers are 300mm in diameter, the “reticle limit”—the maximum area that lithography tools can expose on a wafer—is about 26mm x 33mm. This is the effective size limit for a chip; exceeding it requires connecting multiple chips via an interposer layer, as Nvidia does with B200. Cerebras invented a method to route across “scribe lines” (the boundaries between exposure fields), making an entire wafer into a single chip—no slow inter-chip links needed.

The result: a chip with terrifying compute power, massive SRAM, and access speeds that are astonishing. Data comparison: Cerebras’ latest WSE-3 has 44GB of on-chip SRAM and bandwidth of 21 PB/s; Nvidia’s H100 has 80GB of HBM and bandwidth of 3.35 TB/s. In other words, WSE-3’s memory, though only about half of H100’s, has 6000 times the bandwidth.

The reason for comparing WSE-3 to H100 is that H100 is the most widely used inference chip today, and inference is clearly Cerebras’ strength. You can train with Cerebras, but its inter-chip networking story isn’t compelling, meaning most of its compute and on-chip memory are idle; the real value is its ability to generate token streams far faster than GPUs.

However, the limitations of training also apply to inference: as long as all data fits into on-chip memory, Cerebras’ speed is unmatched; once memory demands exceed the limit (larger models or longer KV caches), Cerebras becomes impractical, especially considering cost. The “wafer as a single chip” tech entails high yield challenges, significantly raising costs.

Meanwhile, I believe Cerebras-style chips will find a market: the company emphasizes speed for programming practicality—since inference involves generating large numbers of tokens, vastly increasing tokens per second equates to faster thinking. But I see this as a temporary use case, for reasons to be explained shortly. The real importance is how long humans wait for answers; as AI wearables and similar products become common, interaction speed (especially voice, which depends on token generation speed) will have a tangible impact on user experience.

Agentic Inference

I previously identified three inflection points in the era of LLMs:

1. ChatGPT proved token prediction is practical.

2. o1 introduced reasoning—more tokens mean better answers.

3. Opus 4.5 and Claude Code introduced the first practical Agents, capable of completing tasks using reasoning models and frameworks involving tool use, work verification, etc.

While all these fall under “reasoning,” I believe the boundary between providing answers—what I call “answer inference”—and executing tasks—what I call “agentic inference”—is becoming clearer. Cerebras’ target market is “answer inference”; in the long run, I believe “agentic inference” architectures will diverge sharply from both Cerebras and GPU paths.

I previously mentioned that fast inference for programming is a temporary use case. Currently, human involvement remains essential: defining tasks, reviewing code, submitting pull requests. But it’s easy to foresee a future where all this is fully automated. This will broadly apply to agent work: the true power of agents isn’t assisting humans but working independently without human intervention.

By extension, solving agentic inference will differ greatly from answer inference. Answer inference values token speed; agentic inference values memory. Agents need context, state, and history. Some of this resides in active KV caches, some in host memory or SSDs, and more in databases, logs, embeddings, and object storage. The key point: agentic inference will no longer be just GPU answering a question but a complex layered memory system built around models.

A crucial implication is that this dedicated memory hierarchy for agents implies a trade-off: speed versus capacity. Moreover, if no human is involved in real-time, lower latency becomes less critical. An agent running overnight doesn’t care about user-perceived delay; it only cares about completing the task. If new memory methods make complex tasks feasible, some latency can be tolerated.

Meanwhile, if latency ceases to be the primary concern, the pursuit of maximum compute and high-bandwidth memory (HBM) becomes less relevant: if delay isn’t a hard constraint, slower, cheaper memory (like traditional DRAM) becomes more attractive. If the entire system mainly waits on memory responses, chips don’t need to use the most advanced process nodes. This could trigger a fundamental architectural shift, but it doesn’t mean current architectures will vanish:

Training will remain important; Nvidia’s current architecture (high compute, high bandwidth, fast networking) will continue to dominate.

Answer inference will be a significant but relatively small market, where ultra-fast solutions (like Cerebras or Groq) are highly valuable.

Agentic inference will gradually decouple from GPUs. The shortcomings of GPUs—wasting memory during prefill, wasting compute during decoding—will become more apparent. Instead, systems dominated by high-capacity, low-cost memory, with “good enough” compute, will emerge. In fact, CPU-based tool invocation speeds might surpass GPU speeds.

Furthermore, these categories differ in scale and importance. Specifically, agentic inference will be the largest future market because it’s not limited by human numbers or time. Today’s intelligent agents are just fancy answer inference; future true agentic inference will involve computers executing instructions from other computers, with market size driven by compute expansion, not population growth.

Implications of Agentic Inference for Compute

So far, “scaling with compute” has often implied Nvidia’s advantage. But Nvidia’s edge has largely depended on latency: its chips are extremely fast, but to keep compute from idling, huge investments in HBM and networking are needed. If latency ceases to be a core constraint, Nvidia’s premium might no longer be justified.

Nvidia has recognized this shift: it launched Dynamo, a reasoning framework to decompose inference into parts, and introduced products like independent memory and CPU racks to enable larger KV caches and faster tool invocation, keeping expensive GPUs busy. But ultimately, large cloud providers might seek alternatives for cost and simplicity in non-GPU-limited agentic inference tasks.

Meanwhile, China, despite lacking top-tier compute, has everything needed for agentic inference: fast enough GPUs, CPUs, DRAM, and storage. The challenge remains in training compute; additionally, answer inference might be more critical for national security (especially military applications).

Another interesting perspective is space: slower chips could make “space data centers” more feasible. First, if memory can be externalized, chips can be simpler and run cooler. Second, older process nodes, with larger physical sizes, are more resistant to space radiation. Third, older nodes consume less power and generate less heat. Fourth, non-leading-edge processes are more reliable—crucial for unmanned satellites.

Nvidia CEO Jensen Huang often says “Moore’s Law is dead”; he means future speedups will rely on system-level innovation. But when agents can operate independently of humans, perhaps the deepest insight is: Moore’s Law no longer matters. The way we gain more compute is by realizing that our current compute is already “good enough.”

NVDAX-6.73%
XAI-5.85%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned