a16z: Why Large Models Need Continuous Learning in the AI Era

Author: Malika Aubakirova, Matt Bornstein; Source: a16z; Translation: Shaw, Golden Finance

In Christopher Nolan’s film “Memento,” Leonard Shelby lives in a shattered present. After experiencing traumatic brain injury, he develops anterograde amnesia, unable to form new memories. Every few minutes, his world resets, trapping him in an eternal now, disconnected from what just happened, and unable to predict what will happen next. To survive, he tattoos clues on his body, takes Polaroid photos, using these external tools to remember information his brain cannot retain.

Large Language Models (LLMs) also live in a similar eternal present. They acquire vast knowledge during training, which becomes fixed in their parameters, yet cannot form new memories—unable to update themselves based on new experiences. To compensate, we build various auxiliary frameworks: treating dialogue history as short-term notes, using retrieval systems as external notebooks, and system prompts as guiding tattoos. But the model itself has never truly internalized new information.

More and more researchers believe this is far from enough. Context learning (ICL) is sufficient for questions where answers or answer fragments already exist somewhere in the world. But for questions requiring genuine original discovery (such as entirely new mathematical problems), adversarial scenarios (like cybersecurity), or tacit knowledge difficult to articulate in words, there is strong reason to believe: models need a capability to directly update their parameters with knowledge and experience after deployment.

Context learning is fleeting. True learning requires information compression. If we cannot enable models to continuously perform compression-based learning, we may forever be trapped in a “Memento”-like eternal present. Conversely, if we can teach models to build their own memory architectures—rather than relying on external tools—we might unlock a new dimension of scalable upgrades.

This research area is called Continual Learning. Although not entirely new (tracing back to McCloskey and Cohen, 1989), we believe it is one of the most important directions in current AI research. Over the past two or three years, model capabilities have grown astonishingly, and the gap between what models “know” and “can know” has become increasingly evident. Therefore, this article aims to share insights from top researchers in the field, clarify different technical approaches to continual learning, and promote its development within the entrepreneurial ecosystem.

Let’s start with context

Before discussing parameterized learning (learning via updating model weights), we must acknowledge: context learning is indeed effective, and there are strong reasons to believe it will continue to hold an advantage.

Transformers are fundamentally next-word prediction models conditioned on sequences. As long as the input sequence is appropriate, they can exhibit a rich array of behaviors without changing weights. This is why prompt engineering, instruction fine-tuning, few-shot learning, and context management are so powerful. Intelligence is embedded in static parameters, but the model’s external performance can change dramatically with the input window content.

Cursor’s recent deep analysis on scaling autonomous programming agents illustrates this well: “The system’s performance ultimately depends on how we design prompts for the agent. Frameworks and models are important, but prompts are even more critical.”

Model weights are fixed. What truly makes the system run effectively is meticulous orchestration of context: what information to include, when to summarize, how to maintain coherence over hours of autonomous operation.

OpenClaw is another excellent example. Its success does not rely on special model permissions (the underlying model is open to all), but on its ability to efficiently convert context and tools into a deployable state: tracking your actions, structuring intermediate artifacts, deciding which content needs to be re-injected into prompts, and maintaining persistent memory of past work. OpenClaw elevates agent framework design into an independent technical discipline.

When prompts first appeared, many researchers doubted whether “prompting alone” could become a formal interaction method; it looked more like a hack. But this approach naturally aligns with the Transformer architecture, requires no retraining, and can scale automatically as model performance improves. The more powerful the model, the better the prompt effects. “Simple but native” interaction methods often outperform more complex ones because they directly collaborate with the underlying system rather than oppose it. So far, this is precisely the state of the art in large language models.

State Space Models: Enhanced Context Capacity

As mainstream workflows shift from direct LLM calls to agent loops, the pressure on context learning modes increases. Previously, fully saturating the context was rare, usually only when executing long independent tasks, and application layers could easily prune or compress dialogue history. But in agent scenarios, a single task can occupy a large portion of available context space. Each step in an agent loop depends on the previous context, and after 20 to 100 steps, performance often degrades—due to exhausted context, declining logical coherence, and failure to converge.

Therefore, major AI labs are investing heavily (e.g., large-scale training) to develop models with ultra-large context windows. This is a natural progression, based on validated effective context learning, aligning with industry trends toward reasoning compute. The most common architecture alternates fixed memory layers with standard attention heads—known as State Space Models (SSM) and variants of linear attention. When handling long contexts, SSM’s extended capabilities fundamentally outperform traditional attention mechanisms.

The goal is to help agents maintain logical coherence over longer cycles, increasing effective steps from around 20 to about 20,000, without sacrificing the broad skills and knowledge of traditional Transformers. If feasible, this would be a major breakthrough for long-duration autonomous agents. You might even see this as a form of continual learning: although weights are not updated, an external memory layer is introduced that requires almost no resetting.

Thus, these non-parametric methods are real and effective. Any evaluation of continual learning must start here. The question is not whether context-based systems are effective—they are—but whether we have reached a ceiling, and whether new methods can take us further.

Limitations of Context: The File Cabinet Fallacy

“AGI and pretraining are, in a sense, overachieving their goals… Humans are not AGI. Sure, humans have foundational skills, but lack vast knowledge. Instead, we rely on continual learning. If I create a super-smart 15-year-old, they actually know very little. They are an excellent student, full of curiosity. You can tell them: ‘Become a programmer, become a doctor.’ Deploying a model itself requires a process of learning and trial-and-error. It’s a gradual process, not a one-shot delivery of a finished product.” — Ilya Sutskever

Imagine a system with infinite storage: the largest file cabinet in the world, perfectly indexed, capable of instant retrieval. It can find any fact. But has it learned?

No. It has never been asked to perform information compression.

This is the core of our argument, inspired by Ilya Sutskever’s view: the essence of large language models is compression algorithms. During training, they compress the internet into parameters. This compression is lossy—and that is precisely what makes them powerful. Compression forces models to discover structure, achieve generalization, and build transferable representations. Models that only memorize training samples are far less capable than those that can extract underlying patterns. Lossy compression itself is a form of learning.

Ironically, the mechanism that makes large language models powerful—compressing raw data into compact, transferable representations during training—is exactly what we refuse to let them do after deployment. We stop compression at release, replacing it with external memory. Of course, most agent frameworks perform some form of context compression, but given the “painful lessons,” shouldn’t we let the model itself learn this compression directly and at scale?

Yu Sun shared a mathematical example to illustrate this debate. Take Fermat’s Last Theorem: for over 350 years, no mathematician could prove it—not because of a lack of related literature, but because the solution was highly innovative. The conceptual gap between existing mathematical frameworks and the final proof was enormous. In the 1990s, British mathematician Andrew Wiles, after nearly seven years of isolated research, finally cracked it, creating powerful new methods to bridge the gap—building a bridge between elliptic curves and modular forms. Although Ken Ribet had previously shown that proving this connection would solve Fermat’s Last Theorem, no one before Wiles had the theoretical tools to build that bridge. Similarly, Grigori Perelman’s proof of the Poincaré Conjecture followed the same pattern.

**The key question: do these examples prove that large language models lack some ability—**a capacity to update priors and engage in truly creative thinking? Or do they, in fact, demonstrate the opposite—that all human knowledge is just data for training and recombination, and Wiles’ and Perelman’s achievements are simply the effects that larger-scale models could achieve?

This is an empirical question, with no definitive answer yet. But we are clear that many problems are currently unsolvable by context learning alone, while parameterized learning might succeed. For example:

[Note: The original text likely continues with specific examples, but they are omitted here.]

Moreover, context learning is limited to what can be expressed in language, whereas model weights can encode concepts that cannot be conveyed through prompts alone. Some patterns are too high-dimensional, too implicit, or too structural to fit into context windows. For instance, visual textures distinguishing benign from malignant lesions in medical images, or subtle rhythmic patterns unique to a speaker’s voice—these are hard to decompose into precise textual descriptions. Language can only approximate them. No matter how long the prompt, such knowledge cannot be fully transmitted—only stored in weights. They reside in the learned representation’s latent space, not in words. No matter how large the context window, there will always be knowledge that cannot be described in text and must be embedded in parameters.

This may explain why features like ChatGPT’s memory—“remembering you”—often make users uncomfortable rather than delighted. Users don’t just want recall; they want capability. A model that internalizes your behavior patterns can generalize to new scenarios; a model that merely retrieves past records cannot. The difference between “this is your reply to this email” (verbatim) and “I understand your style well enough to anticipate what you need” is the fundamental difference between retrieval and learning.

Getting Started with Continual Learning

There are multiple approaches to continual learning. The core distinction is not whether “memory” exists, but where compression occurs. These methods form a continuum:

  • No compression (pure retrieval, frozen weights)

  • Fully internal compression (weight-level learning, truly smarter models)

  • An important middle ground: modular approaches

Context

On the context side, research teams are building smarter retrieval processes, agent frameworks, and prompt orchestration systems. This is currently the most mature direction: infrastructure has been validated, deployment processes are clear and controllable. Its limitation lies in depth—namely, context length.

A promising emerging direction is multi-agent architectures, extending context itself. If a single model is limited to a 128K token window, a cluster of collaborating agents—each holding its own context, specializing in subproblems, and communicating results—can collectively approximate infinite working memory. Each agent completes context learning within its window, then results are aggregated system-wide. Recent projects by Karpathy and Cursor’s web browsing case exemplify early implementations. This is a purely non-parametric approach (no weight changes) that greatly enhances the upper limit of context-based systems.

Modularization

In the modular approach, teams develop attachable knowledge modules (compressed key-value caches, adapters, external repositories) that enable general models to acquire specialized capabilities without retraining. An 8-billion-parameter model combined with suitable modules can match the performance of a 109-billion-parameter model on specific tasks, with minimal memory footprint. The appeal is compatibility: modules can be directly integrated into existing Transformer architectures, swapped or updated independently, and experimental costs are much lower than retraining.

Weight Updates

In the weight update realm, researchers are exploring true parameterized learning—such as sparse memory layers that update only relevant parameters, reinforcement learning loops that continuously optimize based on feedback, and training to compress context into weights during inference. These are the most advanced, but also the most challenging to deploy. They enable models to fully internalize new information or skills.

Several representative research directions include:

[Note: The original text likely lists specific methods, but they are omitted here.]

Research on weight-level updates involves multiple parallel techniques. Regularization and weight-space methods are the oldest: Elastic Weight Consolidation penalizes changes to important parameters; weight interpolation blends old and new weights in parameter space. However, these often lack stability at scale. Test-time training, pioneered by Sun et al. in 2020, involves performing gradient descent on test data, compressing new info into model weights at critical points. Meta-learning explores training models with the ability to learn—ranging from MAML initializations suited for few-shot scenarios, to nested optimization structures inspired by biological memory consolidation, with fast adaptation modules and slow update modules.

Knowledge distillation involves training a student model to fit a frozen teacher’s weights, preserving old task knowledge. LoRD prunes both model and replay buffers for efficient continual operation. Self-distillation reverses the signal source, using the model’s own outputs generated under expert conditions as training signals, avoiding catastrophic forgetting during fine-tuning. Recursive self-evolution is similar: STaR iterates capabilities via model-generated reasoning; AlphaEvolve discovers algorithmic improvements; Silver and Sutton’s “Experience Era” builds on continuous streams of experience for agent learning.

These approaches are increasingly converging. TTT-Discover combines test-time training with reinforcement learning-driven exploration; HOPE nests fast and slow learning cycles within a single model; SDFT turns distillation into a self-improvement foundation. The boundaries between techniques are blurring—next-generation continual learning systems will likely integrate multiple strategies, using regularization for stability, meta-learning for speed, and self-improvement for capability compounding. Many startups are actively exploring this multi-layered ecosystem.

The Startup Ecosystem of Continual Learning

The non-parametric route is currently the most familiar. Agent framework vendors (Letta, mem0, Subconscious) develop orchestration layers and auxiliary architectures to manage input context; external storage and retrieval-augmented generation (RAG) infrastructure (like Pinecone, xmemory) provide retrieval support. Data already exists; the core challenge is selecting and providing precise data snippets at the right moments. As context windows grow, the design space expands, especially in frameworks, with many new startups emerging to manage increasingly complex context strategies.

The parameterized route is more mature, with diverse techniques. Companies are experimenting with deployment-time compression schemes to internalize new info into weights. Based on post-release learning methods, these paths include:

Local compression: learning without retraining. Some teams develop attachable knowledge modules (compressed key-value caches, adapters, external memories) that enable general models to acquire specialized skills without changing core weights. The key idea is to achieve meaningful information compression with a balance of stability and plasticity—focusing on learning rather than just retrieval, isolating the learning process rather than dispersing it across parameters. An 8-billion-parameter model with suitable modules can outperform much larger models on specific tasks. The advantage is modularity: modules can be directly integrated into existing Transformers, swapped or updated independently, and require far less cost than retraining.

Reinforcement learning and feedback loops: learning from signals. Other teams believe the richest signals for post-deployment learning are embedded in the deployment process itself—user corrections, task success/failure, real-world rewards. The core idea is that every interaction can serve as a training signal, not just inference requests. This mirrors how humans improve skills through practice and feedback. The engineering challenge is transforming sparse, noisy, and sometimes adversarial feedback into stable weight updates, avoiding catastrophic forgetting. Once models can truly learn from deployment, their value accumulates over time—something pure context systems cannot achieve.

Data-centric approaches: learning from high-quality signals. A related but distinct path emphasizes curated, generated, or synthesized high-quality data to drive continual updates. The premise is that if models can access rich, structured learning signals, fewer gradient steps are needed for effective improvement. This aligns with feedback loop strategies but focuses more on upstream data quality and learning content.

Novel architectures: embedding learning into design. The most radical view suggests that the Transformer architecture itself is a bottleneck. True continual learning may require fundamentally different computational units—models with continuous-time dynamics and built-in memory mechanisms. The core idea is structural: to build systems capable of ongoing learning, the architecture must incorporate learning mechanisms directly.

Major labs are actively exploring these directions. Some focus on better context management and reasoning chains; others experiment with external memory modules or offline computation (sleep cycles). Many startups are developing entirely new architectures. The field remains early-stage, with no single approach dominating, and future applications will likely involve hybrid solutions.

Why Direct Weight Updates Are Not Feasible

Updating model parameters directly in production causes a cascade of failure modes that remain unresolved at scale.

These engineering issues are well-documented. Catastrophic forgetting occurs when models are sensitive to new data, destroying existing representations—this is the stability-plasticity dilemma. Sequential updates can cause incompatible changes: fixed rules and mutable states are compressed into the same weights, so updating one can damage the other. Logical inconsistency arises because fact updates do not propagate to derived conclusions: changes are limited to token sequences, not semantic concepts. Knowledge erasure remains impossible: there is no differentiable subtraction, so false or harmful knowledge cannot be precisely removed.

There is another less-discussed problem. The current separation of training and deployment is not just engineering convenience but a boundary for safety, auditability, and governance. Breaking this boundary introduces multiple risks: safety misalignment may degrade unpredictably—small fine-tuning on benign data can cause large misalignment behaviors. Continual updates create data poisoning vectors—slow, persistent prompt injection attacks embedded in weights. Auditability disappears because continually updated models are moving targets, impossible to version, regress, or certify once. Privacy risks increase as sensitive information becomes embedded in representations, harder to filter than in retrieval-based contexts.

These are unresolved issues, not fundamental impossibilities. Solving them is as critical as tackling core architecture challenges and remains an essential part of the continual learning research agenda.

From “Memory Fragments” to True Memory

In “Memento,” Leonard’s tragedy is not that he cannot live normally: he is resourceful, even clever in each scene. His tragedy is that he can never achieve the compound benefits of capability—his experiences are all external: a Polaroid, a tattoo, a note. He can retrieve, but cannot compress new knowledge.

As Leonard navigates his labyrinth, the line between truth and belief blurs. His condition not only deprives him of memory but forces him to constantly reconstruct meaning, making him both investigator and unreliable narrator of his own story.

Today’s AI faces the same dilemma. We have built highly capable retrieval systems: longer context windows, smarter frameworks, collaborative multi-agent clusters—and they are effective! But retrieval is not learning. A system that can look up any fact has not been asked to discover structure or generalize. It is the lossy compression mechanism that makes training so powerful—transforming raw data into transferable representations—that we turn off at deployment.

The future may not be a single breakthrough but a layered system. Context learning remains the first line of defense: it is native, validated, and continuously improving. Modular mechanisms can handle personalization and domain specialization. But for original discovery, adversarial adaptation, and tacit knowledge that cannot be expressed in language, models may need to continue compressing experience into parameters after training. This requires advances in sparse architectures, meta-learning objectives, and self-improvement loops. It may also force us to redefine what a “model” is: no longer a fixed set of weights, but an evolving system with memory, update algorithms, and the ability to abstract rules from experience.

File cabinets will only grow larger. But no matter how big, they are still just cabinets. The real breakthrough is enabling models to do after deployment what made them powerful during training: compress, abstract, learn. We stand at a critical juncture—moving from forgetful models to agents with a hint of experiential awareness. Otherwise, we risk forever being trapped in our own “Memory Fragments.”

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin