The person who can't afford tokens has become part of the lower-tier market in the AI era.

Author: Huang Yiting

In 2026, what is humanity’s most “luxurious” consumption in work? The answer is not buying a high-performance, top-spec computer, or outfitting oneself with a few decent outfits, but being able to use the world’s most advanced AI tools without restrictions or cost considerations.

This means you no longer need to rack your brain to optimize prompts to control costs, fearing a message like “Today’s free quota has been used up”; nor do you have to compare repeatedly, reluctant to let go of your beloved Claude (a large language model developed by the American AI company Anthropic), and can only delegate less important tasks to cheaper, lighter models.

AI is of course useful, but each use incurs a cost, with token consumption so expensive that you start to feel unaffordable. Frugality and caution have become the most genuine state of today’s AI “oxen and horses.”

This reminds me of twenty years ago, the era of dial-up internet. Back then, bandwidth was scarce and expensive, and developers compressed images and streamlined code to save bandwidth, almost never uploading videos. Video startups like Tudou were rare, and the bandwidth consumption from videos became the main cost of website operation.

A scene replayed.

In the AI industry chain, computing power flows like water from top to bottom. Starting from upstream GPUs (graphics processing units) and data centers, passing through cloud providers and model vendors, being encapsulated into APIs, and finally flowing to developers and ordinary users, turning into repeated calls and billable tokens. It seems intangible, but at each link, there are clear costs—GPU depreciation, electricity consumption, high-bandwidth storage—all eventually summed into bills.

Now, this water pipe is becoming congested. On one side, demand is exploding; complex scenarios like multimodal and agents cause token consumption to increase by thousands of times. On the other side, supply remains unbound; GPU, HBM (high-bandwidth memory), electricity, and data center construction all face physical limits, with GPU utilization still relatively low. Intelligence comes at a price—although explosive growth has lowered token prices, the money spent on calling them is increasing.

Price increases are cascading. Upstream GPU prices are high but scarce; cloud providers in the middle adjust prices first. Amazon Cloud, Google Cloud, Baidu Cloud, Alibaba Cloud, etc., have raised some AI-related service fees in the past quarter. Model vendors have also ended subsidies; Tencent, Alibaba, etc., have stopped free public testing and increased API call prices, with Tencent’s Hunyuan large model seeing a maximum price hike of 463%.

Price hikes on models and applications make computing power no longer an abstract concept in the competition among giants; it is now a paid lesson for ordinary people in the form of tokens. Just like traffic in the past, priced in MB (mobile data units), users can easily run up bills and get cut off.

Nvidia’s Jensen Huang recently proposed the concept of “Token Economics,” believing that inference has become the core workload of AI, and tokens are the new bulk commodity—standardized, measurable, tradable. As a result, tokens have evolved from a technical byproduct of model training into a key factor driving the digital economy.

In Huang’s view, “Tokens” as commodities have quality differences. From free tiers to top-tier levels, the price per million tokens ranges from $0 to $150. Low-latency, high-interaction tokens (like real-time conversations, autonomous driving) require expensive computing power and are priced high; high-throughput, offline processing tokens (like large-scale offline inference, batch data processing) are less sensitive to latency and can be produced with cheaper compute, thus priced lower.

Tokens have already created value layers as “commodities.” But what about the users of these tokens? Perhaps in the future, the definition of “downstream markets” will no longer be limited to those who can afford physical goods.

AI Users, Driven by Anxiety

“Am I not a distinguished member?” On the evening of March 11, Su Yu looked at the pop-up window on her computer screen, feeling a bit angry. The window prompted that her token usage for the week had reached 90% of the limit, and after the quota is exhausted, the related models will be suspended until next week’s quota refresh.

Su Yu is a doctoral student at a university, recently preparing her graduation thesis. Over the past three years, Google’s Gemini and OpenAI’s ChatGPT have been her best partners, and she is a loyal subscriber to both. In mid-February, Anthropic’s Claude also joined her team and quickly became her most trusted.

“Claude is so easy to use, so tool-like.” Su Yu said. She had several AI applications working simultaneously to help her brainstorm and design research models. ChatGPT’s answers lacked logical rigor; Gemini was too flamboyant and flattering. Only Claude, like a professional senior advisor, read through client needs carefully before outputting truly usable, inspiring solutions.

After more than half a month of free use, Su Yu paid about 180 RMB for a monthly subscription to Claude. Compared to Gemini and ChatGPT, what’s special about Claude is that it also sets daily and weekly token limits for members. This is understandable—according to the global top model blind test leaderboard LMArena, as of March 20, Claude’s main model Claude-Opus-4-6-thinking ranked first worldwide.

But Su Yu never felt such direct token restrictions. The first time Claude’s limit was triggered was on a Wednesday. When she had only half understood the “Grounded Theory,” she could no longer call it. At that moment, she felt a sense of academic stagnation. Used to Claude’s assistance, it was hard for her to return to her original research state. She tried “manual” work, flipping through original theoretical books, but it was very inefficient. She didn’t fully trust some translated materials either, thinking, “In the end, I still have to review after Claude is back.” Four days of waiting felt torturous.

Claude’s usage limit made Su Yu extremely anxious. One Tuesday, she sent a screenshot of Claude’s backend showing her weekly limit had been used up 45%. “It’s only been less than two days this week! I’ve been very frugal—discussing only one paper topic a day—and it’s already hit the limit!” Su Yu was almost breaking down. Who said AI can’t replace humans? This AI is almost more difficult to handle than her advisor.

● Su Yu’s Claude backend. Photo source: interviewee

She’s developed the habit of checking the backend after each question, fearing running out of tokens. She even reminisced about chatting idly with Claude, asking it to help make PPTs, and scolding herself for wasting time.

This cautious use of “useful models” is gradually becoming common. An AI film industry entrepreneur told me that his team uses ByteDance’s AI video model “Yimeng” alongside APIs from several other model vendors, “The better models are indeed more expensive. We can only switch between different models to balance costs.”

Not long ago, Yimeng reduced its membership points quota. He thought it was normal—“C-end users are subsidized, now they’re just pulling back some.” But he also worried about his own situation, sighing, “Now it’s even harder to afford.” Rising AI costs can sometimes directly threaten the survival of small startups.

End users worry about tokens; model vendors worry about computing costs.

Discussing the reason for the surge in token calls, Chinese Academy of Engineering academician Wang Jian used an analogy with electricity development. Early AI applications were like “turning on a light,” consuming limited electricity. But the new generation of applications, represented by OpenClaw (agents), is like turning on an “air conditioner,” requiring more and more power.

However, Wang emphasized that this growth not only indicates application proliferation but also the decline in the cost per individual token. “If electricity prices don’t fall, ordinary people can’t afford air conditioning.”

Compared to simple question-answering in the early days, now more tasks are completed through agents. Models need to decompose problems, call tools, write code, debug, and revise. A seemingly simple request often involves multiple reasoning rounds and API calls. Token consumption grows exponentially; although unit prices have fallen, overall computing costs are higher.

“Models have grown larger, inference costs have increased accordingly. We also hope to bring it back to normal commercial value. Relying on low prices long-term isn’t good for the industry, and that’s a consideration for us,” said Zhang Peng, CEO of Zhipu. In the past two months, Zhipu has raised the prices of its GLM (Zhipu’s large language model) series three times, with some models approaching the pricing levels of top international models.

Zhang Peng’s other concern is, “The biggest challenge in the next 12 months may be computing power. All technologies, including agent frameworks, have greatly enhanced creativity and efficiency—by ten times. But the prerequisite is that people can actually use them; if computing power isn’t enough, a problem might make an agent think for a long time without giving an answer.”

Flowing Computing Power, Accumulating Costs

According to Claude’s calculation, 100 tokens roughly equal about 75 English words or 50 Chinese characters, and the price for token output is five times the input price—this is a simple conversion. In other words, every AI response involves careful thought, backend processing, querying, generation, and even errors caused by model hallucinations, all counted as token consumption, ultimately turning into real bills.

林志佳, founder of AGI Era, did a calculation. He keeps four “lobsters”—some deployed locally, some in the cloud. For cloud deployment, he subscribes to Coding Plan (AI coding subscription service) at about 30-40 RMB per month. When it was nine days before the end of March, his token consumption was less than 10% of the package quota—media person that he is, his token demand isn’t high.

But billing by tokens isn’t very cost-effective. “If I just ask it to send me a news brief every morning at nine, the token cost is about 0.9 RMB. Over 30 days, that’s about twenty-something RMB, roughly the same as buying the Coding Plan. Sometimes there’s wastage, model updates, and just updating can cost three or four RMB worth of tokens.”

Balancing between different billing methods has become routine for high-frequency users, and the tiny costs of buying tokens all point to one thing—computing power, and the associated GPU depreciation and data center electricity costs.

GPUs are the starting point; the supply of high-end chips determines the system’s upper limit. “Apart from reserved backup machines for some clients, everything else is sold out, not a single card left,” said Liu Hua, deputy general manager of UCloud’s architecture tech center.

Below GPUs, data centers, networks, and storage systems must be built—high-speed interconnection and low-latency transmission are not “plug-and-play” standard parts. Liu Hua mentioned that costs for network and storage alone can account for about 20% of total compute costs.

One layer down are model vendors and API service providers. They deploy large models on this infrastructure, encapsulate them into standardized interfaces for developers. In recent years, these roles have begun to overlap; cloud providers sell both compute and model APIs, gradually becoming the hub connecting GPUs, models, and developers.

● Diagram of how compute flows. Source: AI-generated

Compute power seeps down layer by layer. The latest change is on the demand side. “AI used to be mostly B2B paid, now B2C payments are becoming more common,” said 林志佳. Models are encapsulated as APIs, entry points are simplified, lowering usage barriers. Individual developers and even ordinary users can directly call underlying compute. “Now, just scrolling through social media, everyone knows how to use it.”

Compute even shows a retail trend. Before 2024, some cloud providers will launch GPU “daily cards,” lightweight cloud hosts, and even “one-click deployment” products. For example, UCloud’s 6.9 RMB trial package for “shrimp farmers” is essentially a ticket that bundles environment setup and compute scheduling, allowing users to try at very low cost. “Many are here to ‘test waters’ or try new things,” Liu Hua said. “Everyone’s a bit anxious, afraid of falling behind.”

But lowering the threshold doesn’t mean costs decrease. “Using the internet development analogy, the current cost of compute power is undoubtedly still in an early, expensive stage,” Liu Hua believes. That’s why developers are careful with usage, and platforms dare not easily loosen call scales.

Even top-tier vendors are making trade-offs. OpenAI previously shut down the Sora video generation project, which industry insiders interpret as a balance between compute and ROI—focusing resources on core models and business. Big internet companies like Alibaba, Tencent, ByteDance have recently adjusted their AI businesses, also reflecting a focus on compute resource allocation.

Everyone is realizing one thing: in the future, it’s not about the scale of compute power, but about utilization efficiency. The chain reaction caused by compute shortages is like a long rainy season in the AI era—everyone will inevitably get damp.

What Happens at the End of the Compute Flow

Su Yu is trying to allocate and schedule compute resources.

She categorizes different models: ChatGPT for official documents and summaries, Gemini for drawing and language details, Claude for core tasks like research frameworks, idea design, and long-text analysis. This maximizes her efficiency and budget.

For example, she recently processed a batch of interview materials, first asking Claude to give an analysis framework, then “throwing” that framework to Gemini for initial coding. “I trust Claude’s guidance more, but the detailed work can be handed to cheaper models.” If Claude had no limit, she would even stop using Gemini.

Of course, this isn’t an advertisement for Claude—she believes her needs are better suited to this application. Good models are becoming scarce, and scarce resources are only used where most needed.

To save further, many users, like Su Yu, start to cut costs on details.

On social platforms, it once became popular to converse with AI in classical Chinese, because shorter words mean fewer tokens. Some also think that saying “hello” or “thank you” to AI might be an unnecessary resource waste—after all, AI doesn’t need emotional value.

In fact, much waste isn’t within user control; sometimes it’s about how models are integrated and run.

Recently, Luo Fuli, head of the MiMo large model team, said, “I can’t precisely calculate the losses caused by third-party harness (interface) access, but I’ve closely observed OpenClaw’s context management, and it’s terrible. In a single user query, it triggers multiple low-value tool calls, each as a separate API request, with context windows often exceeding 100K tokens. The actual number of requests is several times that of Claude Code’s native framework. Converted to API pricing, the real cost is roughly dozens of times the subscription price.”

Back to usage issues, users actively save tokens, and platforms dare not fully open up user scale. This cost-saving “restraint” creates a contradiction—OpenAI, for example, generated $4.3 billion in revenue in the first half of 2025, but lost as much as $13.5 billion, meaning it loses three dollars for every dollar earned. The biggest expense is the investment in compute power.

Today, compute power is no longer just about availability but about whether it can be sustained and to what extent. When AI is highly useful, people will reorganize work around it; when tokens become expensive and limited, this new organization will be forced to contract.

If compute power can’t truly become as ubiquitous as electricity in the future, AI will inevitably polarize, and the cognitive gap between people will widen further. For example, Su Yu doesn’t plan to fully share her AI usage methods with colleagues—how she interacts with Claude and what kind of data she feeds is her little secret, and for now, her competitive advantage.

If colleagues ask her for model recommendations, she would strongly suggest Gemini and ChatGPT, “Of course, DeepSeek is also a good choice.” Su Yu winked playfully.

In the era of “one-person companies (OPC)” and “super individuals” gaining popularity, such “little tricks” are not rare. When AI’s usefulness is translated into billable tokens, the real difference lies in how the user employs it.

(The name Su Yu in the article is a pseudonym)

Cover source: “Cosmos Exploration Editorial Department”

References

Emergence of Intelligence: “Yang Zhilin / Zhang Peng / Xia Lixue / Luo Fuli / Huang Chao, talking about lobsters, talking about ‘Token Economics’”

Daily Economic News: “AI drives massive token consumption, memory hardware shortages, the surge in compute leasing, operators ramp up liquid cooling servers” / “Zhipu Zhang Peng: When models are strong enough, APIs are the best business model”

Jiemian News: “Zhipu’s stock hits a new high, new generation models raise prices by another 10%”

Deep潮TechFlow: “Token going overseas, selling China’s electricity to the world”

Silicon Star Pro: “Luo Fuli: Wake up, it’s time to end the false carnival of tokens”

GLM2.17%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin