AI investors' 2026 anxiety: When models consume everything, what is left of startups' moat?

Question

Author: Sarah GuoCompiled by: Deep Tide TechFlowDeep Tide Introduction: When large models begin to crush humans on every leaderboard, investors fall into a kind of despair—besides Anthropic and NVIDIA, what else is worth betting on? This top Silicon Valley investor uses data and case studies to show that the real moat isn’t on the charts. It’s hidden in places that benchmarks can’t measure.By mid-2026, investor-version AI madness has become a form of despair: there’s nothing worth investing in anymore. We should put all our money into Anthropic and NVIDIA and go home.I’ve never felt that way. I’m convinced models are already several versions smarter than I am, and I’m happy to buy Anthropic and NVIDIA at market prices. All my smartest friends are also quite sure that self-improvement will succeed soon—yet I still can’t feel that despair.This despair isn’t stupid. The logic is simple: if models keep getting better at everything, then every company built on top of them is just a thin layer of packaging—waiting to be absorbed. The only value that survives is compute power and frontier weights.Take software, for example. This is the case despair theorists rely on most. When Devin was released in 2024, it could only solve 13% of tasks on standard software benchmarks and was basically ignored. A year and a half later, the best agents reach over 80 points, and they’re doing real work inside Goldman Sachs and the U.S. Army. Almost everyone drew the same incorrect lesson: models are eating software engineering. But when models devour the most easily measured parts of software engineering, we’re rediscovering what many teams have known for a long time: engineering has always resisted measurement, and the most measurable parts may not be the only important parts.MIT’s Mert Demirer and his coauthors finally put numbers on it: among more than 100,000 developers, the latest coding agent increased the amount of code written by about 180%, while the amount of code actually deployed increased by about 30%. Writing code got cheaper. The rest still requires humans—and it matters. Of course, the net impact is still astonishing.A benchmark is something you can measure, and what you can measure is what you can train on. That’s why coding agents mature first: compilers are free validators, test suites are free validators, and when answers can check themselves for free, you can keep refining them until you beat them. But tests have never told you whether a change is correct for a codebase with three undocumented modules, maintained by a decade-old repository that barely survives on a cron job that nobody wants to admit they wrote.You can’t read that kind of correctness off a leaderboard. In fact, you can’t read it off of anything. You learn whether such a complex system works by running it long enough in the real world—and smarter models don’t make the world run faster. Nobody does unit tests on something the scale of Google and then trusts the green checkmark. They trust it because it has withstood years of real load. This kind of correctness isn’t just private. It’s the kind of slow moat that capital can’t collapse. Even optimists admit the clock can’t be skipped: Noam Brown, a pioneer of OpenAI’s reasoning models, recently wrote that the only reliable way to evaluate an agent over a one-year time horizon might be… to run it for a year.As Gabe Pereyra put it, true automation isn’t only models getting better. It’s products, models, workflows, and companies moving together—and among those four, three move at organizational speed.The moving parts are the parts benchmarks can’t reach: getting a skeptical partner to change how she handles tasks, and keeping the team cohesive during rebuilding. That’s why when we hire CEOs, people-management ability is at least as important as analytical ability—and a smarter model won’t change that weighting. Feedback is fuzzy, the time horizon is measured in years, and trust belongs to a person. I know every company where all engineers adopt cutting-edge coding models, but none have changed their engineering organization at nearly that speed. Adoption takes a quarter—what a miraculous token-growth quarter!—but rebuilding takes years.What’s visible is what’s leaving. Valuable work is structurally invisible: anything you can put on a leaderboard, you can train for—so anything measurable is already on the road to commoditization. This process takes time and will never be fully complete, but the direction never reverses. In money terms from my friend Matt MacInnis at Rippling: tokens spent answering general questions are almost worthless, because anyone’s model can answer them. Tokens spent reasoning over your company’s data are worth much more, because they do what you truly want—not just what looks reasonable.Visible work is being consumed from two directions. From below: task saturation. Once a job can be checked cheaply, buyers stop asking which model did it, and start asking how much it costs. The work shifts to the cheapest open-source or distilled model that week. Wherever they can make an impact, profit margins ultimately matter. From above: labs are trying to get models to devour their own scaffolding. Retrieval, routing between cheap and expensive calls, tool use—even reasoning strategies—everything that once wrapped the model is pulled into the weights until the wrapper becomes the model. That’s absorption of the frontier. And profit-margin pressure pushes back in the opposite direction too: general agents must be ready for anything, which is expensive. Focused applications can tune a workflow until it runs on a small fraction of token spend, and unlike labs that sell those tokens, they keep the spread.So we can ask two questions about any kind of work. Is its correctness private and costly to establish—truths that only exist inside someone’s data? Is it isolated, locked inside a system you can’t access? Compare those with how saturated the tasks are, and you get a 2x2 matrix. Saturated work with public answers is commodity tokens, and open-source models own it. Frontier work with public answers—where coding benchmarks live—is where labs win, because when evaluation is free, owning it doesn’t matter. The prize sits in the final corner—the untrainable one: correctness exists only in the private frontier. You can see it in inference clouds hosted by AI-native pioneers, where most tokens are generated by custom models rather than generic open-source models.The “wall height” into that last corner varies. A toy codebase from an individual developer is portable and standardized, so the climb is short. Bank production systems are neither—you won’t get root access just because you’re 2% smarter on SWE-Bench Verified.Capabilities eat many things, but better models won’t turn private underlying facts into public ones. They don’t hold licenses, don’t sign responsibilities, and don’t own a company’s documents. When answers are wrong, they can’t become the party being sued. Intelligence isn’t the bottleneck here. Licensing is. Responsibility is too. You can imagine a model far smarter than everyone else, but it still has to be allowed through the door—and someone has to sign off on what it does.That door has a lock and a latch. The lock is the environment: you can only verify whether AI has done something useful after it’s trusted inside the system—after security review, integration, and signed contracts for the results. The latch is the user. Today, most doctors in the U.S. open OpenEvidence every day, and no amount of compute can buy that. Labs could train a perfect medical model tomorrow and still not be able to enter doctors’ routines—or the decision-making process at UCSF—because trust is built slowly, based on relationships, requiring user consent rather than wiping out their gradient descent.That’s also what “work” is. An application earns its spot in the untrainable corner by doing unglamorous work: arranging a company’s private reality so the model can act on it, providing tools that enable the model to take action, and collaborating with customers to change their employees’ reality. A company that brings translation is hard to copy—and translation never ends. Integration and maintenance last as long as relationships do. Teams that place domain-specific engineers and tools right next to the customer win.For example, at a top white-shoe law firm, M&A alone runs nearly a thousand deals per year. For reasons of confidentiality and many others, you can’t let hundreds of assistants each download client files to their desktops and ask a general agent to review them. Even if you could, what you would learn would be fragments—one assistant’s correction at a time—without seeing how the entire deal flows. Critical signals exist at the transaction layer, which has a shape: for M&A, confidentiality agreements, term sheets, due diligence, purchase agreements, annexes, closing checklists; for IP litigation, motions, disclosures, prior art, and more motions. Each business domain has its own structure, and lawyers and tools can’t be swapped across domains. The actual problem a law firm solves sits one layer above all of that: running each practice area in parallel—like top partners running hundreds of matters at the same time—while introducing new matters and training assistants. Transforming such a firm isn’t a single task you can evaluate with a single metric. It requires an operator to do it with data analysis, with goals that are extremely vague, feedback that’s incomplete, long time horizons, and in an environment that never stands still.Unfortunately, invisible value is also hard to sell, for the same reason it’s hard to commoditize: companies can’t judge from the outside whether AI will transform their operations, just as benchmarks can’t judge. So the strongest companies stop trying to prove it externally and instead move inward, pricing the results. Sierra charges when its agents solve customer problems, but not when issues are handed back to humans—so price becomes the evaluation. That only works when Sierra has a definition of what “solved” means. Cognition’s Devin takes the same approach in software: it offers “performance guarantees,” and those guarantees only make sense inside systems you’re trusted to enter.Even for service tokens—what everyone likes to call pure commodities—the “layer” doesn’t work like commodities. The best AI-native companies concentrate their service on one or two providers (Baseten or Fireworks), because token costs are planned to be commoditized, but real reliability under real traffic and guaranteed access to scarce compute are not. Where you serve is a different choice from which models you use. Price is the only part of inference that functions like a commodity.A commonly raised counterargument is that labs are your suppliers—why wouldn’t they run their own first-party products below cost to squeeze you, or revoke your API access and take over the market? That’s the real version of despair, and it only holds when the model layer is a single-player game. Clearly, it’s not—what it looks like is a death race among three and a half parties, with a group of international players training six months behind, and an alliance scale five times larger than last year. Customers want competition among suppliers, while labs want market share, not to kill any application.You can see this in the market where labs face off head-to-head. In consumer chats, the best models have never simply won. ChatGPT maintained a lead through years of real competition; the market share it has lost is flowing to Gemini, powered by Android and search rather than better models. Anthropic, rated highly in prediction markets (and internet sentiment) as the company with the best models, is almost irrelevant in consumer chats—but it has built its business in enterprise and coding. If better models can’t steal users in the most core applications, they won’t cut through hospital records or bank responsibilities via integrations. Public choices today aren’t just about coding. If the frontier stays crowded, its upper layers will be valuable.If work can’t be scored from the outside, someone inside has to decide what even counts as a good answer—and that decision is the entire game. Enough of those written down becomes a benchmark. Harvey published one for law, and Sierra published one for voice agents. By becoming the one already used in a domain, they win the right to define what “good” means for that domain—through real adoption and the struggle to be adopted.Decisive evaluations are private and vary by company: what one company considers good work in this matter is far from complete, because the depth of law makes any public test look pale by comparison. OpenEvidence is defining what safe clinical answers look like. None of these are truly measurements. They’re judgments about what is true and good, written down until they become the standard everyone else measures against—and foundational labs, no matter how smart, can’t write them, because that authority only exists inside the domain. This kind of authority tends to stay where it already sits. Senior lawyers write legal benchmarks. Defining safe clinical answers falls on doctors. And “solved” means whatever any client-possessing company says it means.As we absorb the frontier, the climb keeps rising because we keep learning to measure more kinds of work, and what becomes measurable gets eaten. The untrainable ground shrinks under everyone’s feet, so you can’t find a defensible point and then rest. You keep moving into things that can’t yet be scored, continually re-underwriting. On narrow tasks, using your private data and your own evaluation, you can train up to the frontier and beat general models in the important places—making that specialized model part of the moat. On the other hand, competing on general models is a capital war. You lose to the people with the most compute—a trap for companies with shallow access and visible tasks. It promises that one day, to survive, models will surpass the frontier across general tasks. But the winner is often determined mostly by data-center scale; the outcome is usually not an independent champion but an acquisition by someone with abundant compute.All of this is defense. The harder part is offense: choosing what to build first. That’s what I spent a year trying to find, and I might have found it three times. Models don’t help here. They’ll do anything you point them at, but they can’t tell you what’s worth pointing at—you can’t benchmark that, so you can’t train it. That’s also why existing companies don’t take everything: they hold the ground they already have, and the next thing comes from the people who discover new uses before the rest of us. Maybe the intended scarce input is something rarer than compute.Despair is right about half of it. Thin packaging layers really are being absorbed, and many things that look like “company stuff” today are indeed thin packaging. But it’s wrong about what remains. The mechanism is clear; the destination isn’t. My bet is on the direction: intelligence keeps getting cheaper, and value keeps sliding into the few places where models can’t reach. The untrainable is historically valuable. So enter there, do the unnoticed translation, and start writing down what it means to be good there—because someone will do it. The most cited benchmark scores this year are a map of a territory about to become worthless, and a notice about who’s about to lose the right to define what counts as good.

AI investors' 2026 anxiety: When models consume everything, what is left of startups' moat?

Trending Topics

MyGateTradeStory

USMayCPIHits3YearHigh

USIranConflictEscalates

GateLaunchesHongKongStockTrading

BlackRockBitcoinYieldETFSetToLaunch

Pinned