a16z: Large-scale model deployment leads to forgetfulness, can "continuous learning" break this vicious cycle?

Question

Author: Malika Aubakirova, Matt Bornstein

Compilation: Deep Tide TechFlow

Deep Tide Guide: Large language models are “frozen” immediately after training; once deployed, they can only be maintained through external patches like context windows, RAG, and others. Essentially, they are like amnesiac patients in “Memento”—they can retrieve information but cannot truly learn new things. Two partners at a16z systematically review the frontier research direction of “continual learning,” dissecting this technical track that could redefine the ceiling of AI capabilities through three pathways: context, modules, and weight updates.

In Christopher Nolan’s “Memento,” the protagonist Leonard Shelby lives in a shattered present. Brain injury causes him to suffer from anterograde amnesia, unable to form new memories. Every few minutes, his world resets, trapping him in an eternal “now,” forgetting what just happened and not knowing what will happen next. To survive, he tattoos words on himself, uses Polaroids, relying on these external tools to substitute for the memory functions his brain cannot perform.

Large language models also live in a similar eternal present. After training, vast knowledge is frozen in their parameters; they cannot form new memories or update their parameters based on new experiences. To compensate, we build scaffolds: chat history acts as short-term notes, retrieval systems serve as external notebooks, and prompt engineering is like tattoos on the body. But the model itself has never truly internalized this new information.

More researchers believe this is insufficient. Contextual learning (ICL) can solve problems only if answers (or fragments of answers) already exist somewhere in the world. But for questions requiring genuine discovery (like new mathematical proofs), adversarial scenarios (such as security attacks), or knowledge that is too implicit or inexpressible in language, there are strong reasons to believe: models need a way to write new knowledge and experiences directly into parameters after deployment.

Contextual learning is temporary. True learning requires compression. Before allowing models to continually compress, we might all be stuck in the eternal present of “Memento.” Conversely, if we can train models to learn their own memory architectures instead of relying on external tools, we could unlock a whole new dimension of scaling.

This research area is called continual learning. The concept isn’t new (see McCloskey and Cohen, 1989), but we believe it is one of the most important directions in current AI research. The explosive growth in model capabilities over the past two or three years has made the gap between “what the model knows” and “what it can know” increasingly apparent. The purpose of this article is to share what we’ve learned from top researchers in the field, clarify different paths of continual learning, and promote its development within the entrepreneurial ecosystem.

Note: The formation of this article benefited from in-depth exchanges with a group of excellent researchers, PhD students, and entrepreneurs, who generously shared their work and insights in continual learning. Their perspectives, from theoretical foundations to engineering realities of post-deployment learning, made this piece much more solid than what we could have written alone. Thank you for your time and ideas!

Let’s start with context

Before defending parameter-level learning (i.e., updating model weights), it’s necessary to acknowledge a fact: contextual learning indeed works. And there is a compelling argument that it will continue to do so.

The essence of transformers is sequence-based conditional next-token prediction. Given the correct sequence, you can achieve surprisingly rich behaviors without touching weights. That’s why methods like context management, prompt engineering, instruction fine-tuning, and few-shot examples are so powerful. Intelligence is encapsulated in static parameters, but its capabilities vary dramatically depending on what you feed into the window.

Recently, Cursor published an in-depth article on scaling autonomous programming agents, exemplifying this: model weights are fixed, but the system runs effectively through carefully orchestrated context—what to put in, when to summarize, how to maintain coherence during hours of autonomous operation.

OpenClaw is another example. Its success isn’t due to special model permissions (since the underlying model is accessible to everyone), but because it transforms context and tools into a highly efficient work state: tracking what you’re doing, structuring intermediate artifacts, deciding when to re-inject prompts, and maintaining persistent memory of previous work. OpenClaw elevates agent “shell design” to an independent discipline.

When prompt engineering first emerged, many researchers doubted that “just prompts” could become a proper interface. It looked like a hack. But it is the original product of the transformer architecture, requiring no retraining, and automatically improving as models advance. As models get stronger, prompts get stronger. “Simple but native” interfaces often win because they directly couple with the underlying system rather than fighting against it. So far, the trajectory of LLM development has followed this pattern.

State-space models: the steroid version of context

As mainstream workflows shift from raw LLM calls to agent loops, the pressure on context learning models increases. Historically, fully filled context windows were rare. This typically occurred when LLMs were asked to complete long sequences of discrete tasks, where the application layer could directly trim and compress chat history. But for agents, a single task can consume a large portion of the total available context. Each step in the agent loop depends on the context passed from previous iterations. They often fail after 20 to 100 steps because they “lose the thread”: the context fills up, coherence degrades, and convergence becomes impossible.

Therefore, major AI labs are now investing heavily (large-scale training runs) to develop models with ultra-long context windows. This is a natural path, building on effective methods (context learning) and aligning with industry trends toward reasoning and computation transfer. The most common architecture involves interleaving fixed memory layers within standard attention heads—namely, state-space models (SSM) and linear attention variants (collectively called SSM below). SSMs provide fundamentally better scaling curves in long-context scenarios.

Figure caption: Comparison of scaling between SSM and traditional attention mechanisms

The goal is to help agents extend coherent operation steps from around 20 to about 20,000, without losing the broad skills and knowledge provided by traditional transformers. If successful, this would be a major breakthrough for long-running agents. You can even see this as a form of continual learning: although weights are not updated, an external memory layer is introduced that requires almost no resets.

Thus, these non-parametric methods are real and powerful. Any evaluation of continual learning must start here. The question isn’t whether current context systems are useful—they are. The question is whether we’ve already hit a ceiling, and whether new methods can take us further.

What context misses: “file cabinet fallacy”

“AGI and pretraining are, in a sense, over-tuned… humans are not AGI. Yes, humans have a skill base, but they lack vast amounts of knowledge. We rely on continual learning. If I create a super-smart 15-year-old, they know nothing. A good student, eager to learn. You could say, go be a programmer, go be a doctor. Deployment itself involves some form of learning, trial and error. It’s a process, not just releasing a finished product.” — Ilya Sutskever

Imagine a system with infinite storage space. The world’s largest file cabinet, where every fact is perfectly indexed and instantly retrievable. It can find anything. Has it learned?

No. It has never been forced to compress.

This is the core of our argument, referencing a point made earlier by Ilya Sutskever: LLMs are essentially compression algorithms. During training, they compress the internet into parameters. Compression is lossy, and this lossy nature is what makes them powerful. Compression forces models to find structure, generalize, and build representations that transfer across contexts. A model that memorizes all training samples rigidly is less effective than one that extracts underlying patterns. Lossy compression itself is learning.

Ironically, the mechanism that makes LLMs so powerful during training—compressing raw data into compact, transferable representations—is precisely what we refuse to let them do after deployment. We stop compression at the moment of release, replacing it with external memory. Of course, most agent shells compress context in some custom way. But the bitter lesson is that the model itself should learn this compression directly, on a large scale.

Yu Sun shares an example illustrating this debate: mathematics. Consider Fermat’s Last Theorem. For over 350 years, no mathematician could prove it—not because they lacked the relevant literature, but because the solution was highly novel. The conceptual gap between existing mathematical knowledge and the final proof was enormous. When Andrew Wiles finally cracked it in the 1990s, he worked in near-isolation for seven years, inventing entirely new techniques to reach the answer. His proof relied on bridging two different branches of mathematics: elliptic curves and modular forms. Although Ken Ribet had previously shown that establishing this connection would automatically prove Fermat’s Last Theorem, before Wiles, no one had the theoretical tools to build that bridge. Similarly, Grigori Perelman’s proof of the Poincaré conjecture can be argued along these lines.

The core question: do these examples prove that LLMs lack some kind of capacity—some prior for update, some ability for genuine creative thinking? Or do they, conversely, demonstrate that all human knowledge is just data for training and recombination, and that Wiles and Perelman simply show what LLMs can do at a larger scale?

This is an empirical question, and the answer remains uncertain. But we do know that many categories of problems where context learning fails today could be addressed by parameter-level learning. For example:

Figure caption: Categories of problems where context learning fails but parameter learning might succeed

More importantly, context learning can only handle concepts that can be expressed in language, whereas weights can encode ideas that prompt words cannot convey. Some patterns are too high-dimensional, too implicit, or too deeply structured to fit into context. For example, visual textures distinguishing benign artifacts from tumors in medical scans, or subtle audio micro-fluctuations defining a speaker’s rhythm—these patterns are not easily decomposed into precise words. Language can only approximate them. No matter how long prompts are, they cannot transmit these; such knowledge only lives in weights. They exist in the latent space of learned representations, not in text. No matter how large the context window, some knowledge cannot be described in words and can only be carried by parameters.

This may explain why explicit “memory” functions—like ChatGPT’s memory—often make users uncomfortable rather than delighted. What users truly want isn’t “recall,” but “capability.” A model that internalizes your behavior patterns can generalize to new scenarios; a model that merely recalls your history cannot. The gap between “this is what you wrote last time” (verbatim recitation) and “I understand your thinking well enough to anticipate your needs” is the gap between retrieval and learning.

Introduction to continual learning

Continual learning encompasses multiple pathways. The dividing line isn’t whether there is “memory,” but where compression occurs. These pathways span a spectrum—from no compression (pure retrieval, frozen weights) to full internal compression (weight-level learning, making the model smarter). An important middle ground involves modules.

Figure caption: Three pathways of continual learning—context, modules, weights

Context

On the context side, teams build smarter retrieval pipelines, agent shells, and prompt orchestration. This is the most mature category: infrastructure is validated, deployment paths are clear. The main limitation is depth: context length.

A noteworthy new direction: multi-agent architectures as a scaling strategy for context itself. If a single model is limited to a 128K token window, a coordinated group of agents—each holding its own context, focusing on a problem slice, communicating results—can approximate infinite working memory overall. Each agent performs context learning within its window; the system aggregates. Karpathy’s recent autoresearch projects and Cursor’s web browsing examples are early cases. This is purely non-parametric (no weight updates), but it greatly raises the upper limit of what context systems can achieve.

Modules

In the module space, teams develop plug-and-play knowledge modules (compressed KV caches, adapter layers, external memory stores) that enable general models to specialize without retraining. An 8B model with suitable modules can match the performance of a 109B model on target tasks, with memory overhead only marginal. The appeal lies in compatibility: modules can be plugged into existing transformer architectures, swapped or updated independently, with much lower experimental cost than retraining.

Weight updates

On the weight update side, researchers pursue genuine parameter-level learning: sparse memory layers that update only relevant parameters, reinforcement learning loops that optimize models from feedback, and test-time training that compresses context into weights during inference. These are the deepest, most challenging methods but also the most capable of fully internalizing new information or skills.

Several research directions in weight-level learning include:

Figure caption: Overview of research directions in weight-level learning

Weight-level research runs parallel along multiple routes. Regularization and weight-space methods are the oldest: EWC (Kirkpatrick et al., 2017) penalizes parameter changes based on importance to previous tasks; weight interpolation (Kozal et al., 2024) blends old and new weights in parameter space, but both are fragile at scale. Test-time training, pioneered by Sun et al. (2020), later evolved into architecture primitives (TTT layers, TTT-E2E, TTT-Discover), which involve doing gradient descent on test data to compress new info into weights at need. Meta-learning asks: can we train models that “know how to learn”? From MAML’s few-shot friendly initialization (Finn et al., 2017) to Behrouz et al.’s nested learning (2025), which structures models as layered optimization problems with fast adaptation modules and slow updates, inspired by biological memory consolidation.

Distillation retains knowledge of previous tasks by matching student models to frozen teacher checkpoints. LoRD (Liu et al., 2025) makes distillation efficient enough for continual operation by pruning models and replay buffers simultaneously. Self-distillation (SDFT, Shenfeld et al., 2026) reverses the source: using the model’s own outputs under expert conditions as training signals, bypassing catastrophic forgetting in sequence fine-tuning. Recursive self-improvement operates along similar lines: STaR (Zelikman et al., 2022) guides reasoning from self-generated inference chains; AlphaEvolve (DeepMind, 2025) discovers decades-old algorithms; Silver and Sutton’s “Age of Experience” (2025) defines agent learning as an endless stream of continual experience.

These pathways are converging. TTT-Discover already integrates test-time training with RL-driven exploration. HOPE nests fast and slow learning cycles within a single architecture. SDFT turns distillation into a core self-improvement operation. The boundaries between columns are blurring. The next generation of continual learning systems will likely combine multiple strategies: regularization for stability, meta-learning for acceleration, and self-improvement for compounding gains. An increasing number of startups are betting on different layers of this stack.

Continual learning startup landscape

The non-parametric end of the spectrum is the most well-known. Shell companies (Letta, mem0, Subconscious) build orchestration layers and scaffolds managing what goes into context windows. External storage and RAG infrastructure (like Pinecone, xmemory) provide retrieval backbones. Data exists; the challenge is putting the right slices in front of the model at the right time. As context windows grow, the design space for these companies expands, especially on the shell side, with a wave of new startups emerging to manage increasingly complex context strategies.

Parameter-side, earlier and more diverse. These companies attempt some form of “post-deployment compression,” enabling models to internalize new information in weights. The paths roughly fall into different bets about how models should “learn” after release:

Partial compression: learning without retraining. Some teams build plug-and-play knowledge modules (compressed KV caches, adapters, external memory) that allow general models to specialize without touching core weights. The common argument: you can achieve meaningful compression (not just retrieval) while controlling the stability-plasticity tradeoff, since learning is isolated rather than scattered across the entire parameter space. An 8B model with suitable modules can match the performance of a much larger model on target tasks, with memory overhead as a minor concern. The advantage: composability—modules can be plugged into existing transformer architectures, swapped or updated independently, with much lower experimental cost than full retraining.

Reinforcement learning and feedback loops: learning from signals. Other teams bet on the idea that the richest signals for post-deployment learning already exist within the deployment cycle—user corrections, task success or failure, real-world reward signals. The core idea: models should treat every interaction as a potential training signal, not just inference requests. This closely mirrors how humans improve at work: doing tasks, receiving feedback, internalizing effective methods. The engineering challenge is turning sparse, noisy, sometimes adversarial feedback into stable weight updates without catastrophic forgetting. But a truly deployable, continually learning model would generate compound value in ways that context systems cannot.

Data-centric: learning from the right signals. A related but distinct bet is that bottlenecks are not in the learning algorithm but in training data and surrounding systems. These teams focus on filtering, generating, or synthesizing high-quality, well-structured data to drive continual updates: assuming a model with high-quality, structured signals can improve meaningfully with far fewer gradient steps. This naturally connects with feedback loop companies but emphasizes upstream issues: whether the model can learn at all, what it should learn, and to what extent.

New architectures: learning ability from the ground up. The most radical bet is that the transformer architecture itself is a bottleneck. Continual learning requires fundamentally different primitives—architectures with continuous-time dynamics and built-in memory mechanisms. The argument is structural: if you want a system capable of continual learning, embed the learning mechanism into the core infrastructure.

Figure caption: Startup landscape for continual learning

All major labs are actively exploring these categories. Some are developing better context management and reasoning chains; others are experimenting with external memory modules or sleep-time compute pipelines; a few stealth startups are pursuing new architectures. The field is still early; no single approach has yet dominated, and given the broad use cases, there probably won’t be only one winner.

Why naive weight updates fail

In production, updating model parameters triggers a series of failure modes that are still largely unresolved at scale.

Figure caption: Failure modes of naive weight updates

Engineering issues are well documented. Catastrophic forgetting means that models sensitive enough to learn new data will destroy existing representations—the stability-plasticity dilemma. Time decoupling refers to the fact that immutable rules and mutable states are compressed into the same set of weights; updating one damages the other. Logical integration failure occurs because fact updates do not propagate to their inferences: changes are limited to token sequences, not semantic concepts. Unlearning remains impossible: there is no differentiable subtraction operation, so false or toxic knowledge cannot be precisely excised.

A second, less-discussed problem is that the separation of training and deployment is not just an engineering convenience but a boundary for safety, auditability, and governance. Opening this boundary introduces multiple issues. Safety alignment may unpredictably degrade: even narrow fine-tuning on benign data can cause widespread misalignment. Continual updates create an attack surface for data poisoning—a slow, persistent prompt injection vector embedded in weights. Auditability collapses: a continually updated model is a moving target, making version control, regression testing, or one-time certification impossible. When user interactions are baked into parameters, privacy risks increase, and sensitive information becomes baked into representations, harder to filter than in retrieval contexts.

These are open problems, not insurmountable ones. Solving them is part of the ongoing research agenda for continual learning.

From “Memento” to genuine memory

Leonard’s tragedy in “Memento” isn’t that he can’t operate—he’s resourceful and even excellent in any given scene. His tragedy is that he can never compound his gains. Every experience remains external—Polaroids, tattoos, notes. He can retrieve, but cannot compress new knowledge.

As Leonard navigates this self-constructed maze, the boundary between reality and belief blurs. His condition not only deprives him of memory but forces him to constantly reconstruct meaning, making him both detective and unreliable narrator of his own story.

Today’s AI faces the same constraints. We’ve built powerful retrieval systems: longer context windows, smarter shells, coordinated multi-agent groups. They work. But retrieval is not learning. A system that can find any fact is not forced to seek structure. It’s not compelled to generalize. The mechanism that makes training so powerful—transforming raw data into transferable representations—is precisely what we turn off at deployment.

The path forward is likely not a single breakthrough but a layered system. Context learning remains the first line of defense: native, validated, and continually improving. Modular mechanisms can handle intermediate zones of personalization and domain specialization. But for truly hard problems—discovery, adversarial adaptation, implicit knowledge beyond words—we may need models that continue compressing experience into parameters after training. This requires advances in sparse architectures, meta-learning objectives, and self-improvement loops. It may also force us to redefine what a “model” is: not a fixed set of weights, but an evolving system that includes its memory, its update algorithms, and its capacity for abstraction from experience.

The file cabinet grows larger. But even the largest file cabinet remains a file cabinet. The breakthrough lies in enabling models to train themselves after deployment—compress, abstract, learn. We stand at the cusp of moving from forgetful models to ones with a flicker of experience. Otherwise, we risk being trapped in our own “Memento.”

a16z: Large-scale model deployment leads to forgetfulness, can "continuous learning" break this vicious cycle?

Trending Topics

WCTCTradingKingPK

CryptoMarketSeesVolatility

rsETHAttackUpdate

US-IranTalksStall

ETHMemeCoinFLORKSurges

Pin