After AI devours everything, what remains untrainable?

Question

> Original Title: The Untrainable > Original Author: Sarah Guo, Conviction > Translation: Peggy, BlockBeats > Editor's Note: As AI capabilities continue to leap forward, a new wave of pessimism is emerging in the investment community: if models keep getting stronger, all application companies will eventually be swallowed by models and computing power from Anthropic, OpenAI, Nvidia, and similar giants, leaving only cutting-edge models, hardware, and a few infrastructure players. But Sarah Guo believes this judgment is only half right. Those "thin wrappers" (simple shell applications of models) will indeed be absorbed; tasks that can be benchmarked, trained on public data, and validated at low cost will also gradually become commoditized. The real question is: after AI devours everything trainable, what remains untrainable? The answer in this article is the value that exists within real organizations and cannot be easily copied externally: private corporate data, complex workflows, user trust, system permissions, industry judgments, compliance responsibilities, and the experience accumulated over long-term operation. Models can become smarter, but they cannot automatically access a bank’s production systems; they can generate medical answers but cannot directly earn doctors’ trust or integrate into hospital decision-making processes; they can write legal texts but cannot assume responsibility like senior lawyers, nor define what constitutes qualified legal work out of thin air. Therefore, the truly defensible AI companies in the future will not simply be smarter than general-purpose models, but will deeply embed themselves within specific industries, completing the difficult yet critical "translation" work: organizing clients’ private realities, tools, workflows, and standards into systems that models can act upon, and over long-term service, gradually defining what constitutes a good result. The stronger the AI, the more it devalues measurable and replicable tasks; it also highlights those "things that cannot be trained"—those rooted in history, relationships, permissions, and professional judgment. This is the true value that may remain after models have swallowed everything else. Below is the original text: By mid-2026, the investor version of "AI madness" is a sense of despair that nothing is worth investing in anymore: it seems we should just pour all our money into Anthropic and Nvidia, then go home and sleep. But I’ve never felt that way. For several small versions now, I’ve been convinced that models are already smarter than I am; I’d be happy to buy into Anthropic and Nvidia at market prices; my smartest friends are also quite certain that self-improvement in models will soon truly take off—but I still don’t feel despair. This despair isn’t foolish. Its logic is as follows: if models keep strengthening across all domains, then all companies built on models are just thin shells waiting to be absorbed; ultimately, the only remaining value will be hardware and the weights of frontier models. Take software as an example—this is the case most dependent on this despair. When Devin released his software model in 2024, it could only solve 13% of tasks in standard benchmarks, so it was largely dismissed by the market. A year and a half later, the strongest agent could score over 80% and was beginning to handle real work within Goldman Sachs and the US Army. Almost everyone drew the same mistaken conclusion: models are swallowing software engineering. But once models have taken over the easiest-to-measure parts of software engineering, we are also re-recognizing something teams have known for a long time: engineering has always resisted measurement, and the parts easiest to measure are not necessarily the only important parts. MIT’s Mert Demirer and his colleagues finally quantified this: among over 100k developers, the latest coding agent increased code writing by about 180%, but the actual code delivered to production only increased by around 30%. Coding has become cheaper, but the remaining steps still require humans—and these steps are crucial. Of course, the overall net impact remains astonishing. Benchmark tests are something you can measure; and anything measurable can be used for training. Therefore, coding agents are among the first to mature: compilers are free validators, test suites are free validators. When answers can be checked at nearly zero cost, you can keep refining around this check until you break through. But passing tests never means that a change is correct for a codebase that has been running for ten years. The existence of a module may be justified by three undocumented reasons; the deployment pipeline might rely on a cron job no one wants to admit they wrote. This correctness cannot be read from leaderboards, nor can it be directly inferred from anything else. You can only know if it’s effective by letting such a complex system run long enough in the real world. Smarter models won’t make the real world run faster. No one would run unit tests on a system as large as Google and feel completely at ease just because they see a green check. Trust comes from enduring years of real load. This correctness is not only private but also a slowly forming moat—a moat that capital cannot directly compress in time. Even optimistic voices admit this clock cannot be skipped. Noam Brown, a pioneer of OpenAI’s reasoning models, recently wrote: evaluating an agent’s performance over a year might be the only reliable method—let it run for a year. As Gabe Pereyra says, true automation isn’t just about models becoming stronger. It’s about products, models, workflows, and organizational structures changing together, and three of these four evolve at the pace of the organization. What truly motivates people is something that no benchmark can reach: convincing a skeptical partner to change how she handles tasks, or maintaining team cohesion during rebuilds. That’s why, when hiring a CEO, we value their ability to manage people at least as much as their analytical skills. Smarter models won’t change this weighting. The feedback here is fuzzy, and the time horizon is measured in years; trust belongs to specific individuals. Every company I know has already equipped every engineer with cutting-edge coding models, but none has organizationally changed at a pace close to model progress. Adopting tools took only a quarter—what a miraculous quarter of token growth! But true rebuilding takes years. Work that can be seen clearly is leaving. Truly valuable work is structurally unreadable: anything that can be ranked can be used for training; thus, everything measurable is heading toward commoditization. This process takes time and will never be fully complete, but the direction is irreversible. In the words of my friend Matt MacInnis of Rippling, translating this into monetary terms: a token used solely to answer a general question is nearly worthless because anyone’s model can answer it; but a token that performs reasoning over your company’s data is much more valuable because it does what you truly want, not just generate a plausible answer. Readable work will be swallowed from two directions. From below, tasks will saturate: once a task can be checked cheaply, buyers no longer care which model completed it—they only care about the price. So, this work will fall into the hands of the cheapest open-source or distilled models each week. As long as profit margins can be maintained, it will inevitably happen. From above, labs are trying to make models swallow their own scaffolding. Routing between retrieval, cheap and expensive calls, tool use, even reasoning strategies—all the mechanisms once outside the model—are being integrated into model weights until the "shell" itself becomes a model. This is the absorption boundary. Profit pressures also work from another angle: a general agent must be ready to handle anything, making it costly; whereas a specialized application can optimize a workflow to consume only a small fraction of tokens. Moreover, unlike labs selling tokens, application companies can keep the margin. Therefore, for any task, we can ask two questions: is its correctness private and expensive, a truth that only exists within a company’s data? Is it isolated within a system inaccessible to outsiders? Combining these questions with the saturation level of the task yields a 2×2 matrix. Work that is saturated and has publicly available answers belongs to the commoditized token space, where open-source models will dominate. Frontier but open-answer tasks, like coding benchmarks, are areas where labs will win because evaluation is free, and owning the task itself isn’t valuable. The real prize is the last corner—the "untrainable" corner: frontier work, but with correctness only existing in private environments. You can see this on inference clouds serving AI-native pioneers: most tokens are generated by custom models, not general open-source ones. The wall to this last corner has heights and depths. A developer’s toy codebase is portable and standardized, so it’s not hard to infiltrate. But a bank’s production system is neither portable nor standardized. You won’t gain root access just because you’re 2% smarter on SWE-Bench Verified. Capabilities will swallow many things, but better models won’t turn private real standards into public ones. They won’t hold licenses, sign responsibilities, or own company documents; when answers go wrong, they can’t be sued. The bottleneck isn’t intelligence but permissions and accountability. You can imagine a model far smarter than anyone, but it still must be allowed in, and someone must sign off on what it does. That door has a lock and a latch. The lock is the environment: only after trust is established within a system, through security reviews, integration, and signing contracts with responsibility for results, can you verify whether AI has truly done something useful. The latch is the user. Today, most American doctors open OpenEvidence daily—something no compute power can buy. A lab can train a perfect medical model tomorrow, but it still can’t integrate into doctors’ habits or hospital decision workflows. Trust is built slowly, through relationships and user acceptance, not by gradient descent wiping these away. This is also the work of application companies. An application can occupy the "untrainable" corner by doing the unglamorous work: organizing a company’s private reality so models can act on it; providing tools for actions; and working with clients to change their operational workflows. A company capable of this kind of "translation" is hard to replicate, and this translation will never truly end. Integration and maintenance will continue as long as client relationships last. The winners are those teams that embed domain-specific engineers and tools into their clients’ environments. For example, in a top-tier, long-established law firm, nearly a thousand mergers and acquisitions transactions are handled annually. You can’t have hundreds of lawyers and assistants downloading client files individually for a general agent to read. Confidentiality already forbids this, not to mention a dozen other issues. Even if it were possible, what you learn is just fragments: an assistant corrects a small part once, no one can see how an entire deal flows. The truly important signals exist at the transaction level. Each deal has its shape: for M&A, NDAs, term sheets, due diligence, purchase agreements, annexes, closing checklists; for IP litigation, motions, evidence disclosures, prior art, more motions. Every business domain has its structure, and lawyers and tools cannot be swapped at will. And the law firm’s real higher-level challenge is: how to run each practice area simultaneously—like top partners managing hundreds of matters in parallel, while onboarding new clients and training junior lawyers. Transforming such a firm isn’t a single evaluation task. It requires a handler to manage it like "data baseball": with highly fuzzy intermediate goals, incomplete feedback, long cycles, and an environment that never stands still. Unfortunately, unreadable value is also hard to sell, for the same reason it’s hard to commoditize: a company cannot externally judge whether AI can truly transform its operations as benchmarks suggest. Therefore, the strongest companies will stop trying to prove themselves externally and instead go inside clients, pricing based on results. Sierra only charges when its agent solves client problems; if the issue is handed over to humans, it doesn’t charge. So, price itself becomes an evaluation mechanism. This works because Sierra has the authority to define "solved." Devin from Cognition did the same in software, launching a "performance guarantee." Only when you are trusted enough to operate inside a system can you offer such guarantees. Even at the layer of token services—that is, what everyone calls a pure commodity—performance isn’t truly commoditized. The best AI-native companies concentrate their services with one or two providers, like Baseten or Fireworks. Because the cost per token will trend toward commoditization, but reliability under real load and stable access to scarce compute are not commodities. Where inference services are provided and which models are used are two separate choices. The only part that truly resembles a commodity in inference is the price. A common rebuttal is: labs are your suppliers; why wouldn’t they dump their own first-party products below cost to crush you? Or revoke your API access and take over the market? That’s the real version of that despair. But it only holds if the model layer is a one-player game. Clearly, that’s not the case. The model layer is more like a death race among three and a half players, with a few international competitors six months behind in training progress, and a development alliance five times the size of last year. Customers want competition among suppliers; labs want market share, not to kill any specific application. You can see this in markets where labs compete head-to-head. In consumer chat scenarios, the best models have never simply dominated the entire market. ChatGPT has maintained a lead through years of real competition; its recent share loss is to Gemini, driven by Android and search distribution, not model quality. Anthropic is currently considered to have the best models in prediction markets and internet sentiment, but it’s hardly a major player in consumer chat, instead building its business in enterprise and coding scenarios. If a better model can’t steal users from core applications, it won’t easily take over a hospital’s medical records system or a bank’s risk management system through integration. Today, public product choices are based not just on coding ability. If frontier models remain crowded, then the application layer above will have value. If a task cannot be scored externally, someone inside must decide what counts as a good answer. And that decision is the game itself. When enough such decisions are written down, they become benchmarks. Harvey has released legal benchmarks; Sierra has benchmarks for voice agents. Your right to define what "good" means in a domain comes from the fact that the domain already uses you. These companies have earned that right through the hard process of real adoption. The evaluation that truly influences money flow is private and company-specific: what a company considers a good job in such matters. And this is far from complete, because the depth of law exceeds any public test. OpenEvidence is accumulating what constitutes safe clinical answers. All of this isn’t truly "measurement" in the strict sense, but about judgments of what is real and what is good. These judgments are written down until they become standards that everyone must accept. No matter how smart foundational model labs become, they cannot invent these standards out of thin air, because such authority only exists within the domain. This authority often resides where it already exists. Senior lawyers write legal benchmarks. Doctors define safe clinical answers. What "solved" means is decided by the company that already has client relationships. The absorption boundary will continue rising, as we keep learning to measure more work, and measurable things will be swallowed. The untrainable ground will shrink beneath those standing on it, so you can’t just find a defensible position and stop. You must keep moving toward those areas that cannot yet be scored, continuously re-underwrite and reassess risks. In narrow tasks, with private data and your own evaluation systems, you can train to the frontier and beat general models in key scenarios; this specialized model becomes part of your moat. Conversely, competing on general model capabilities is a capital war—you will lose to those with the most compute. That’s why companies with only shallow access and highly readable tasks are most vulnerable. When a company chooses to train beyond the frontier on a broad set of general tasks for survival, the outcome often depends on data center scale. The final result is rarely a standalone champion; more often, it’s sold to a well-resourced player. All of this is defensive. The harder part is offense: first, deciding what to build. This is what I’ve been searching for this year, and I’ve only found it three times. Models can’t help with this. Wherever you point them, they will do; but they can’t tell you what’s worth pointing at. You can’t build benchmarks for it, so you can’t train for it. That’s why giants won’t take everything: they will defend their existing turf, while the next breakthrough comes from someone who discovers new uses before others. Perhaps, intent is a more scarce input than compute. Half of this despair is correct. Thin shells are indeed being absorbed, and many things that look like companies today are just shells. But their judgment of "what remains after absorption" is wrong. The mechanism is clear, but the endpoint is not. I am willing to bet on this direction: intelligence will continue to become cheaper, and value will keep sliding into places that few models can reach. The untrainable is rooted in historical value. So, entering one of these domains, doing those unglamorous translation tasks, and then defining what "good" means there—because someone will do it. The most cited benchmark scores this year are actually a map of territories that will soon become worthless, and a notice: a notice that some will soon lose the right to define what "good" means. [Original Link] Click to learn about Lydian BlockBeats’ open positions **Join the official Lydian BlockBeats community:** Telegram Subscription Group: https://t.me/theblockbeats Telegram Group Chat: https://t.me/BlockBeats_App Twitter Official Account: https://twitter.com/BlockBeatsAsia

After AI devours everything, what remains untrainable?

Trending Topics

MyGateTradeStory

USMayCPIHits3YearHigh

USIranConflictEscalates

GateLaunchesHongKongStockTrading

BlackRockBitcoinYieldETFSetToLaunch

Pinned