Two years ago, we lived in a different world. Opening an API — and large models would continuously generate code, text, responses to anything. No one cared that we threw thousands of words of documents into the Prompt, making GPT-4 do trivial tasks like capitalization. Why? Because it was cheap. Investors paid. Companies subsidized. It was a period of free resource usage.

But the sleep is over. Power is becoming more expensive everywhere — this is not a prediction, but the reality happening right now. The fight for NVIDIA H100 has become a geopolitical conflict. Data center energy consumption is approaching the limits of power grid capabilities. Major players are no longer playing charity.

When your business scales and daily requests exceed millions of calls, even a small fee per 1K tokens turns into a flood of expenses. It’s a money-pumping machine. It’s a nightmare that wakes startup CFOs in the middle of the night. Tokens have become a real monetary unit.

Where do your tokens go? People often don’t understand. They look at monthly bills that are growing, like an incomprehensible book. Loss occurs in the least noticeable places.

First: you politely talk to AI. “Hello, can you help? Thank you very much, please...” People see this as normal, but in token economics, it’s robbery. Large models don’t need your “please” and “thank you.” Every word is a token, every space is money. Even worse — extremely long system prompts that repeat in every session: “Follow the ten principles...” “If you don’t know, say I don’t know...” Useful? Yes. But if this repeats millions of times, it’s astronomic waste.

Second: uncontrolled RAG. Ideally: extract three relevant sentences. In practice: a user asks something, the system pulls ten 10,000-word PDFs and feeds them into the model. The developer thinks: “Let it find itself.” This is not laziness; it’s a crime against computational power. Irrelevant information not only hampers the attention mechanism but also leads to astronomic token consumption. You thought you asked a simple question, but in reality, you forced the model to read half a library.

Third: agent without limits. The ReAct mode makes AI think and act like a human. But if the API disconnects or logic falls into a loop, the agent will spin infinitely. Each reasoning cycle consumes expensive output tokens — they cost several times more than input tokens. An agent without a proper emergency stop mechanism is a black hole that swallows your budget.

How to save? First: semantic caching. User requests are often similar. “How to reset my password?” comes hundreds of times a day. Instead of GPT-4 every time — convert the request into a vector, compare with cache. If similarity is high, return the answer from cache. No tokens. Delay from seconds to milliseconds. This is not just savings; it’s a leap in experience.

Second: prompt compression. Long context is a sin. Information entropy algorithms analyze which words are critical and which are redundant. You can compress a 1k-token text to 300, preserving the essence. Let machines communicate in machine language — humans find it awkward, but AI understands. You save 70% of costs.

Third: model routing. Don’t throw everything at the most expensive model. For simple entity extraction or translation, route to cheaper open models like Llama 3 8B. For complex logical reasoning — use GPT-4o or Claude 3.5 Sonnet. Like a well-organized company: requests that can be handled by reception don’t go to the CEO. The one who sets this up most precisely can reduce total token costs to a tenth of competitors.

The forefront already understands this. When looking at the most advanced agent ecosystems — especially those moving toward mobile devices — a battle for maximum token optimization is visible. On mobile, there’s no room for large context. Throughput is limited, memory is limited, energy is limited.

OpenClaw controls token usage almost obsessively. Instead of rough full-context overlays, it relies on structured output data. It forces the model to produce results in strict JSON Schema. It doesn’t allow AI to “communicate” — it makes it “fill out forms.” This reduces unnecessary characters, saving bandwidth.

Hermes Agent from Nous Research demonstrates surgical context management. Instead of storing the entire history, it introduces dynamic memory. Working memory: last 3-5 conversations. Long-term memory: when context overflows, a lightweight model summarizes the dialogue into a few sentences, stored in a vector database. The old dialogue is deleted, but knowledge is preserved. This is not waste but surgical removal. Such context management not only overcomes physical limitations but also ensures rapid cost reduction at a macro level.

The main trend is clear: future agents will compete not by using more tools, but by performing the most complex tasks within extremely limited token budgets. Dance in chains. The one who dances best wins.

But all these are technical details. Essentially — it’s a shift in the mindset of the entire AI industry. Previously, we treated tokens as a consumer good. Saw a discount — threw it into the cart. It didn’t matter if a large model was truly needed; what mattered was that it “looked cool.” Companies blindly connected LLMs to everything, issued accounts to every employee, even for cafeteria menus. When the bill arrived — shock.

Now, it’s time to switch to investment thinking. Every token consumption is an investment. With investments, ROI is calculated. This token is spent — what did it bring me? Did the closing rate increase? Did bug fixing time decrease? Or is it just “Haha, such a funny AI”?

If a function using traditional machine learning costs 10 cents, and a large model requires $1 per token but increases conversion by only 2% — cut it without hesitation. We no longer strive for “big and comprehensive” AI, but for “small and refined” precise hits.

It’s necessary to learn to say “no” to business units. When asked: “Can AI read all 100,000 reports and give a summary?” — ask back: “Will your revenue cover several million token expenses?” Count it. Save. Count tokens like a traditional store owner.

It doesn’t sound cyberpunk. It sounds rustic. But it’s a necessary step toward mature AI.

The widespread increase in hashrate is not a crisis but a delayed cleansing. It burst the bubble of unlimited subsidies and brought everyone back to cold reality. But that’s good. It forced us to abandon blind faith in “great power — miracle” and restore respect for engineering efficiency.

Companies that survive and grow are not those with the most expensive models. But those who, watching the rapidly changing token figures, stay calm and confident that they earn more than they spend. When the tide recedes, it’s clear who swims naked. This time, the tide is receding from the benefits of hashrate. Only those who forge every drop of tokens like gold can take on real armor.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
319.15K Popularity
#
CryptoMarketsDipSlightly
218.79K Popularity
#
DailyPolymarketHotspot
652.34K Popularity
#
SolanaReleasesQuantumRoadmap
12.74M Popularity
#
TapAndPayWithGateCard
12.41K Popularity

Sitemap

The era has ended when computational resources could be used freely without considering cost. Hashrate is becoming more expensive, and that changes everything.

Trending Topics

WCTCTradingKingPK

CryptoMarketsDipSlightly

DailyPolymarketHotspot

SolanaReleasesQuantumRoadmap

TapAndPayWithGateCard

Pin