Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
IPO Access
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
When reasoning becomes a scarce resource, who captures the value?
Null
Original author: Frank Fu
Original source: IOSG Ventures
The hole David Cahn pointed out in 2023 has never been filled on the training side. It was filled on the inference side, and the market only began to incorporate it into pricing in the past few weeks. When Nvidia reorganized its financial reporting around "service tokens," Cerebras went public with a 20x oversubscription, and the bottleneck debate was settled, the real question became the next: when inference becomes a scarce resource, where does its value settle within the compute stack?
Following GPU trends: from a $200 billion problem to a $600 billion problem
In 2023, Sequoia’s David Cahn raised the question hanging over the entire AI buildout, known as the "$200 billion problem." For every dollar spent on GPUs, roughly another dollar is spent on data center power to run them, meaning each year's GPU CapEx must ultimately generate about $200 billion in revenue to recoup that capital. Even with very generous assumptions about AI revenue, he found a gap of over $125 billion between "investment" and "actual customer payments." The concern was straightforward: GPUs are being overbuilt ahead of actual demand.
A year later, not only had the gap not narrowed, but it had widened. In his 2024 follow-up, as CapEx from large-scale vendors ballooned, he redefined it as the "$600 billion problem." The bearish logic converges into a familiar shape: overbuilding leads to oversupply, which burns capital.
Both articles are essentially asking the same question: who will fill this gap? The answer has never appeared on the "training" ledger. It appears on the inference side, and only in the past few weeks has the market started to price it in.
Cerebras IPO and inference squeeze
Cerebras went public on Thursday. The IPO was oversubscribed 20 times, with a price nearly double the final offering price on Wednesday. The demand wasn’t driven by bets on "the next Nvidia killer," but by a simpler realization: the real bottleneck in AI is inference, not training.
Cerebras’ core strength is a chip architecture that makes inference extremely fast. Not training—just inference. That’s what excites Wall Street. The inference market is recurring; it expands with usage. Every time Claude answers a question, every time an agent executes a task, compute power is consumed. Training happens once; inference never stops.
J.P. Morgan estimates the inference market size to be 10 to 50 times that of training. As machines begin executing tasks assigned by other machines—agentic expansion—demand for inference no longer grows with the number of users but with compute capacity itself.
Nvidia redraws the landscape: inference becomes headline news
If Cerebras represents the market awakening, Nvidia’s latest quarterly report confirms it from the industry’s top tier. In the earnings call, Jensen Huang made it clear: AI demand is growing exponentially. The reason is simple: agentic AI has arrived. Mainstream AI has transitioned from one-time inference to logical reasoning, and now into the agent stage, where systems call tools and orchestrate tasks themselves. Huang said, "Tokens are now profitable." In the AI era, compute is revenue and profit.
This reshapes the entire industry. Training is a one-time cost to build a model; inference is the ongoing operational expense. The current bottleneck is inference, not training.
Nvidia incorporated this view into its financial reporting. It now reports on two platforms instead of one: Data Center (about $75 billion this quarter, +92% YoY) and Edge Computing (about $6.4 billion, +29% YoY). Data Center is further divided into Hyperscale (~$38 billion, +12% QoQ) and ACIE, which includes AI cloud, industrial, and enterprise (~$37 billion, +31% QoQ). A new line is Edge Computing: $220k, +29% YoY, covering agentic AI and physical AI endpoints like PCs, workstations, AI-RAN base stations, robots, and cars.
Edge still accounts for less than 8% of total revenue, but Nvidia has elevated it to a "second platform" alongside data centers. The signal is: inference is splitting into two fronts—cloud inference in data centers and endpoint inference at the edge—AI needs to see, move, and act in the physical world. The roadmap follows the same logic: starting in Q3, shipments of Vera Rubin will reach up to 35 times the inference throughput of Blackwell; Huang also introduced a new $200 billion TAM for Vera CPUs designed for agentic workloads. Leading model companies are expected to switch to it on day one.
As the world’s most valuable company reorganizes its financial disclosures around "service tokens," the bottleneck debate is settled. The rest of this article discusses who captures the value when inference (not training) becomes a scarce resource.
Let’s clarify the scope. In these two fronts, this article discusses cloud inference—API token services provided via rented data center GPUs. Endpoint inference runs on-device, on local chips (Nvidia Jetson, RTX, Drive, AI-RAN), completely independent of GPU leasing and aggregation stacks. Think of this as amplifying the inference economy and supporting the bottleneck argument, not the markets where Hyperbolic and Venice operate, which are entirely cloud-based.
The squeeze is already happening
Anthropic is the canary in the coal mine. Usage far exceeds pre-provisioned capacity, and complaints about Claude being "brain-dead," slow inference, and compressed context windows flood the internet. The solution is raw compute: in May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX, with over 220k Nvidia GPUs and 300+ MW, dedicated to inference, not training.
This capacity unlocks a series of quota adjustments, each signaling something. On May 6, Anthropic doubled the five-hour limit for Claude Code, removed peak-time throttling, and significantly increased Opus API rate limits. On May 13, it increased Claude Code’s weekly quota by another 50% (until July 13). Then, starting June 15, it did the opposite of generosity: it moved agentic and programmatic use (Agent SDK, headless Claude -p, CI pipelines) out of flat subscriptions into a separate, metered credit pool ($20–$200/month, billed per API). This step encapsulates the core argument: agent consumption of inference far exceeds what flat subscriptions can support, so it must be priced as a recurring operational cost.
Training is a one-time capital expense. Inference is a recurring operational cost that compounds with each new user and agent.
This stack: six layers, one bottleneck
Every AI application sits within a supply chain starting from TSMC wafer fabs to API endpoints:
Most companies only own one layer. Nvidia owns silicon, CoreWeave owns bare metal, Together AI owns inference optimization, OpenRouter owns model API routing.
Only one company is an exception.
Hyperbolic: the only company spanning all three layers
Hyperbolic launched its on-demand GPU marketplace in June 2025. In the first few months, it attracted over 200k developers, covering cutting-edge AI labs, search, and large consumer platforms.
Its architecture is interesting.
Hyperbolic doesn’t own a single GPU. Each card comes from neocloud and data centers, including CoreWeave, Lambda Labs, Nebius, and smaller operators with idle capacity. This might seem like a weakness, but it’s actually a moat.
By sitting between GPU supply and demand, Hyperbolic can see real-time data others cannot. It knows who is buying what GPUs, at what prices, and when. It saw this before supply excess became public, and before demand surged and hit the market.
Today, the moat is this multi-cloud aggregation. Hyperbolic stitches together fragmented capacity from dozens of clouds and data centers into a standardized, unified pool, allowing developers to rent the cheapest available GPU anywhere without negotiating with each provider or managing multiple accounts. The more clouds it connects to, the deeper the liquidity, and the richer the pricing data. Going forward, the team is exploring how to model GPU price curves using this data and eventually deploy its own capital to smooth supply and demand, acting as a market maker for physical compute; but this is still early, and the aggregation layer is the real compounder at present.
This is the flywheel:
More cloud access → more aggregated supply
More supply → deeper markets and real-time pricing data
Better data → smarter routing today, and long-term pricing models
Better liquidity and prices → more developers → more clouds wanting to connect
No other company is attempting this. Hyperbolic is the only one spanning GPU leasing, deployment, and model API layers.
Venice as a mirror
Venice exemplifies the inference economy at the application layer and offers a useful contrast to Hyperbolic’s position. It is a privacy-first inference app: an OpenAI-compatible API plus consumer subscriptions (Free / Pro / Pro+ / Max), routing requests to about 75 models, two-thirds of which are open-source or self-hosted (Llama, Mistral, Qwen, DeepSeek), with the rest passing through frontiers of closed-source models anonymously. The key is, Venice itself doesn’t own significant compute. It rents from undisclosed GPU partners and confidential compute providers (NEAR AI Cloud, Phala), paying frontiers labs for passthrough, so its true cost of revenue is inference compute, not SaaS hosting.
Venice’s real value proposition is privacy. This "privacy" isn’t about turning public compute into private assets but about wrapping commercial inference with guarantees: no data retention, no training data, anonymized requests, some workloads running inside TEE, so operators can’t see plaintext. The underlying compute is commodity, but the added privacy layer commands a premium. This layer is layered and heterogeneous: for open-source models running on self-controlled or TEE GPUs, near end-to-end confidential computing is possible; but for closed models like Claude or GPT, anonymized passthrough only strips identity, while the frontier labs still process raw prompts. The strongest privacy coverage is for open-source parts; for frontier models, privacy is "anonymity," not "true confidentiality." Venice’s gross profit = subscription revenue minus inference costs paid downstream, and the premium it charges over raw API prices is almost entirely supported by this privacy premium, which explains its thin margins and reliance on frontier passthrough pricing.
Token design packages this inference demand. Venice operates with two tokens: VVV (staking and platform access) and DIEM, a type of inference credit, with each DIEM roughly equivalent to $1 of compute per day. Paid subscriptions trigger programmatic buyback and burn of VVV (Pro / Pro+ / Max roughly $2 / $5 / $10), with issuance decreasing on a fixed schedule: 6M → 5M → 4M VVV monthly, down to 3M after July 1. Buybacks are real but discretionary and modest: in April and May, about $103k and $105,000 respectively, and in June, slowly approaching about $110k, well below the $200k monthly line.
The fundamentals are healthier than the headlines. The widely circulated "$70 million ARR" figure is almost certainly a misinterpretation of renewal revenue as net new customer acquisition; a more defensible observable range is closer to $6 million to $15 million ARR. Beneath that, traction is real: about 136k token holders, roughly 9.9 million website visits per month (about 330k daily), with new Pro subscriptions hovering around 1,400 per day. It’s a real business, but a thin-margin one, its economics constrained by the compute it purchases.
This is precisely why Hyperbolic sits one layer above Venice. If Venice is a gas station, Hyperbolic is a refinery. Venice buys compute from the shared, limited supply everyone depends on; Hyperbolic aggregates and standardizes that fragmented supply, then sells it to Venice and others like it. As inference demand grows, value accumulates not just in applications consuming compute but also in the aggregation, routing, and capturing of the cost of revenue paid by those applications.
Why this matters now
Nvidia is reorganizing its financials around "service tokens." Cerebras’ IPO proves the market understands inference is the bottleneck. Anthropic’s capacity hunt proves it’s a real issue. Agentic and physical AI will multiply demand by several orders of magnitude, spanning cloud and edge.
It also completes the circle around the "$600 billion problem." Cahn’s bearish logic—overbuilding, then oversupply—will likely be validated. But oversupply is precisely the best scenario for lightweight aggregators: as GPU prices fall and supply fragments across dozens of clouds, the player that doesn’t own hardware but routes workloads to the cheapest available cards will profit from the spread, while operators holding depreciating GPUs bear the losses. Hyperbolic is betting on oversupply, not shorting it.
The ultimate winner won’t be the one with the most GPUs but the one that can tell you where GPUs are, at what price, and route workloads to the lowest-cost provider.
Hyperbolic is building such a company. It doesn’t own GPUs itself, is purely software-based, spans three layers, but aims to become the ultimate aggregation layer for inference compute.