"Fable 5 costs far more than a Chinese programmer's daily wage. Burning millions of tokens a day to write code is already frugal, and then you see a bill for thousands of RMB."

This is the reality unfolding now. According to the latest data, Anthropic's own spending on computing power has already reached 2.3 times its payroll. Based on a senior engineer's total cost of $224k, the annual computing cost per engineer at Anthropic is about $515k. In other words: the model costs more than the person.

Facing such bills, even Claude itself has to start saving tokens.

Claude Code: Burning Tokens for the Illusion of "Being Highly Productive"

Recently, a new term has emerged in the industry: Token Apocalypse.

From token maxing to token apocalypse, it signals a major paradigm shift in the AI industry. Earlier this year, around March and April, everyone was flaunting how many tokens they used, almost like a leaderboard. But using AI doesn't automatically mean saving money, so people started emphasizing the cost per token.

More subtly, large models are also expanding work that didn't need AI in the first place. Now we don't want to read PDFs ourselves, or long articles—we have AI summarize everything. Or we use AI to convert these into slides and pass them to someone else, who might then use AI to read those slides... AI seems to be forcibly injecting a layer of value into work that was already rather superficial, while quietly driving up the bill.

Now, cost overruns have become the norm. Companies like Amazon, Adobe, Atlassian, and Citigroup have begun strictly controlling AI usage:

Limit model tiers: Some company employees are banned from using high-end models like Claude Opus, forced to downgrade to cheaper versions;
Set personal limits: Uber capped each engineer's monthly token spending at $1,500;
Revoke permissions entirely: Institutions like Citibank have completely restricted access to advanced AI tools, and employees who fail to meet usage targets even have their corporate accounts revoked. Before this, Uber's CTO admitted that the company had exhausted its annual AI budget within a few months. Walmart recently stopped using some tools as well.

Large companies are either scrambling to find cost-saving measures or slamming the brakes on token waste. Therefore, employees receive a highly contradictory message: on one hand, "AI can increase your efficiency 100 times, you must use it," and on the other, "Don't bankrupt the company anymore."

This is also the most typical problem in the first wave of AI tool adoption: when tools are launched, there aren't enough guardrails to prevent companies from spending millions of dollars on large language models, nor is there a mechanism to alert teams that tokens are burning through quickly. Whether it's chatbots or coding tools, many products prioritize "getting it usable" first, while cost governance, usage quotas, model tiering, and context management are put on the back burner.

But Claude Code is essentially not a productivity tool—it's a marketing tool.

Its design goal is clear: to make you feel productive. Boris, the project lead for Claude Code, initially thought: "If the model becomes smart enough, what will code look like? How do I want to use this?" —The starting point wasn't "how to help developers save tokens," but "how to showcase the model's intelligence."

Anthropic is willing to burn a lot of tokens for this "feeling"—whether it's your money or their own. Spending $200 in five minutes isn't an accident for Claude Code; it's by design. The underlying logic is: if you can solve a problem by burning more tokens, never look for a more token-efficient solution. All the sub-agents, all the fancy UI animations, all the verbose reasoning traces—none are for efficiency; they're for you to stare at the screen and feel, "This model is really smart, really capable."

Behind this is a carefully designed marketing loop: you burn a lot of tokens, get a feeling of "high productivity," then think Claude is good, and continue using it. Anthropic is even willing to bear a large portion of the token cost to earn this emotional buy-in. This is also why their desktop app is clearly underinvested—Claude Code's goal was never to be a good tool, but to become the "best showcase window" for Anthropic's model capabilities.

And it is precisely this design philosophy of "burn tokens for experience" that has caused Claude to fall behind OpenAI in token efficiency.

OpenAI has been desperately compressing tokens. From compressing reasoning traces to optimizing model efficiency, their philosophy is: use fewer tokens to do the same work. Codex 5.5 is the best example.

Even though models like Fable 5 are intelligent, their efficiency isn't high compared to other models. This chart from Deep SWE clearly illustrates the point. Putting the same batch of models side by side makes it even more apparent: GPT-5.5 medium used only 20k tokens to achieve a striking score; while Opus 4.8 used 50k tokens and scored lower.

This is the most direct portrayal of two paths: the industry panics, Claude burns, OpenAI saves. And the next question is—if you need to cut costs, what should be cut first? The answer: those prompts that have been piled up for too long.

Claude Code's Prompt Debt: The More You Stack, the More You Owe

In a recent speech, Anthropic stated that they have deleted 80% of Claude Code's system prompts.

Anthropic technical team member Tariq Shihipar explained that this reflects a fundamental shift in how AI models are guided—in the past, people believed that more instructions and examples led to better model performance; but now, that logic no longer holds. The new model Fable 5 is more imaginative than the examples they gave, and the examples actually become limitations.

Of course, there's a marketing element; he boasted about Fable's capabilities: "Examples can easily limit the model because it is actually more imaginative than the examples we provide." But one fact cannot be ignored: even Anthropic itself has started trimming system prompts.

So why did they need so many prompts in the past?

Over the past year or two, the AI coding community developed a habitual mindset: bigger context is better, more tool instructions are better, more complete system prompts are better. The model doesn't know how the project is organized? Write Agents.md. The model doesn't know how to use tools? Write tool descriptions. The model isn't proactive enough? Write behavioral guidance. The model isn't stable enough? Keep adding constraints to the system prompt.

Undeniably, system prompts were once the core competitive advantage of AI coding tools. Small tweaks to LLM prompts could bring significant performance improvements. If the same model felt different in Codex, Cursor, OpenCode, and Copilot, it was almost certainly due to subtle differences in prompting.

This is why Cursor spent a lot of time testing system prompts, doing A/B testing, and fine-tuning prompt methods for different models. Compared to using Opus in Claude Code, Cursor's harness could significantly improve model performance, with some benchmarks showing improvements of up to 10% to 30%. The core difference was often those few paragraphs of prompts.

But the problem is that as long as prompts are useful, the team keeps adding more. If a model likes to misuse tools, add a rule; if a model isn't proactive enough, add encouragement; if a model searches too much, add a restriction; if a model doesn't understand project context, add another markdown file. Every addition has its reason, but over time, the system prompt becomes a huge persistent context burden.

The problem is: system prompts are not free. They must be read, billed, and occupy context with every call.

After Claude Code embedded all tools and features, the system prompt once ballooned to 65,000 tokens; even with most features turned off, it was still 12,000 tokens. In other words, before the model writes a single line of code, it's already carrying an instruction manual. By comparison, Pi's startup context was less than a thousand tokens.

Worse, prompt debt is more hidden than code debt.

Old code usually reveals itself when you change features, run tests, or fix bugs. But old prompts might just silently degrade model performance. Users see "Claude Code doesn't seem as smart as before," or "The new model isn't as strong as advertised," but the real reason could be that old system prompts haven't kept up with the new model.

When prompts go from being a competitive advantage to a burden, Anthropic chose to delete 80%, further improving token efficiency.

Claude's "Blabber Tax": Every Extra Word, Extra Cost

Claude Code has way too much blabber.

This year, a plugin called Caveman quickly went viral, specifically addressing this issue. Its name literally means "caveman," implying speaking like a primitive human—no politeness, no extra grammar, no filler words, only the core meaning.

At first glance, it sounds like a joke. But once you understand it, you realize it addresses a very real problem in LLMs: too much blabber, too many tokens, unnecessary cost.

And its origin was specifically targeting Claude Code.

"I created Caveman in early April because I was using Claude Code heavily and noticed that a lot of my token spending was wasted on unnecessary text: pleasantries, vague phrasing, transitional phrases, and chatty expressions that aren't really important in the agent loop," said Julius Brussee, creator of Caveman.

Brussee's evaluation showed that Caveman reduces output tokens by 65% to 75% compared to default output, while still outperforming a simple "be concise" instruction. It mainly compresses surrounding language, without affecting code, commands, paths, URLs, function names, and other precise elements.

Reportedly, OpenAI's engineering director Shayne Sweeney also contributed code to the project to support Codex.

More interestingly, OpenAI has already applied this language pattern to the thinking stage.

Some leaked reasoning traces (not the reasoning summary shown externally) gave the outside a glimpse. The content doesn't look like normal English; it's more like compressed engineering shorthand:

"Use core new nodes. Need infer. Need add VAE encode for images. Try. Try period."

These sentences look funny, even messy, but their focus is not readability—it's token efficiency. When a model reasons internally, it doesn't need to be polite, complete, and fluent like when talking to a user. It only needs to keep the actions, objects, judgments, and next steps. In other words, as long as the final answer is normal, the model can internally use a shorter, more crude, more token-efficient language to think, aggressively pursuing token efficiency.

This is even more useful than when writing prompts. Compressing reasoning tokens yields greater benefits because agents execute multiple steps; the previous step's thinking becomes the input for the next step. Every time the model "thinks" less, it not only saves the immediate tokens but also the repeated overhead along the entire execution chain.

This is precisely a clear difference between OpenAI and Claude's paths.

Claude has always been more conversational, more like an assistant that thinks and expresses in full language. Just looking at its much longer reasoning traces, you can guess it might be using normal English. Its outputs and reasoning tend to be longer, so it relies more on large context windows to accommodate this content.

That's why Claude uses a context window of 1 million tokens by default. Many think it's to accommodate a larger codebase, but the reason is simpler: Claude generates things that are too long; without such a large window, it can't fit them. They also perform poorly with compaction. When you restore an old thread, Claude suggests not keeping full context but trying a compact version instead. They don't keep reasoning traces—in fact, they clear these after 10 to 20 minutes because reasoning tokens are too inefficient to retain; otherwise, the cost would become absurdly unacceptable.

In contrast, OpenAI's token context window is about 200k or fewer, but because they compressed them from the start with this terse language, they manage.

A detail worth savoring: if Anthropic fixes the "too much blabber" problem, their revenue would drop significantly. If developers can accomplish the same work with a model but generate fewer tokens, that's money Anthropic won't earn.

Source: InfoQ

Risk Warning and Disclaimer

        Markets are risky; investment requires caution. This article does not constitute personal investment advice and does not consider the specific investment objectives, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their particular situation. Investment at your own risk.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
gStocksTokenizedStocksLive
1.08M Popularity
#
WeakNFPShakesRateHikeOdds
1.03M Popularity
#
PredictWorldCup🇦🇷vs🇨🇻
193.5K Popularity
#
ETHBreaks1700
120.76M Popularity
#
MetaSellsComputeTriggersChipSlump
1.39M Popularity

Pinned

Sitemap

Claude Code deletes 80% of prompt words at will. Anthropic set an example with Fable 5: the "cost reduction" in the AI industry has just begun.

Claude Code: Burning Tokens for the Illusion of "Being Highly Productive"

Claude Code's Prompt Debt: The More You Stack, the More You Owe

Claude's "Blabber Tax": Every Extra Word, Extra Cost

Trending Topics

gStocksTokenizedStocksLive

WeakNFPShakesRateHikeOdds

PredictWorldCup🇦🇷vs🇨🇻

ETHBreaks1700

MetaSellsComputeTriggersChipSlump

Pinned