I noticed something interesting happening in the AI market over the past few months. The party is over. That period when big companies were footing the bill for everything and we could use tokens as if they were running water? That’s in the past.



For two years, we lived in a comfortable illusion. OpenAI, Anthropic, and other giants were burning investor money to subsidize our usage. So what did we do? We sent massive prompts—one thousand words in a text—asked GPT-4 to do ridiculous tasks that a simple rule could solve. Because it was cheap. Because we didn’t have to think about the costs.

But now reality is knocking at the door. Tokens have become real money. Every word, every space, every punctuation—everything has a price. And once you start scaling, when your daily volume rises to millions or billions of calls, that “1K tokens” that once seemed insignificant turns into bleeding you can’t stop.

The problem is that most companies have no idea where the money is being wasted. People look at their monthly bill climbing and don’t know what to do.

Take this: are you polite when you talk to AI? “Hello, could you help me out? Thank you so much…” Well, every “please” and “thank you” is a token being charged. The models have no emotions and don’t need manners. Even more frightening are the enormous system prompts that developers create to ensure stability—thousands of tokens of instructions being recalculated in every conversation. Pure waste.

Then there’s uncontrolled RAG. In theory, it’s perfect: retrieve the three most relevant documents and you’re done. In practice? The vector database pulls the ten most random PDFs, each with ten thousand words, and dumps them all into the model. “You figure it out,” the developer thinks. The result is that the model ends up reading half a library, and you pay for every page.

And I won’t even start on agents stuck in infinite loops. That’s a black hole of tokens. If the API goes down or the logic hits a dead end, the agent keeps spinning wildly, consuming output tokens—which cost several times more than input. Your credit card runs out while you sleep.

But here’s the good part: the industry is waking up to solutions. Semantic caching is the most straightforward. User questions are repetitive by nature. “How do I reset my password?” gets asked thousands of times. Why call GPT-4 every single time? Semantic caching turns the question into a vector, matches it against previous questions, and if it finds something similar, returns the answer directly from the cache. Zero tokens consumed. Latency drops from seconds to milliseconds. This isn’t just saving money—it’s a dimensional shift in the experience.

Then there’s prompt compression. It’s not you manually removing words. Algorithms based on information entropy can identify what’s essential and what’s noise. They can compress a one-thousand-token text while keeping the core meaning into three hundred tokens. Let the machines “talk” to each other in a kind of “Martian text” we don’t understand, but the model understands perfectly. You save 70% on fees.

But the real turning point is model routing. Don’t send everything to the most expensive model. Simple entity extraction, translation, format conversion? Send it to Llama 3 8B running locally or to Claude 3 Haiku. Cost is almost negligible. Deep reasoning, complex programming? That’s when you call GPT-4o or Claude 3.5 Sonnet. It’s like an efficient company: the receptionist handles simple inquiries, and the CEO only deals with strategy. Anyone who implements this well can reduce total token costs to one-tenth of the competition.

What impresses me most is seeing frameworks like OpenClaw and Hermes already operating in this reality. OpenClaw is obsessive about efficiency. It doesn’t use the brute-force approach of throwing the entire context at the model. It forces the model to produce structured output—strict JSON, binary formats. It eliminates redundant characters during generation. The AI doesn’t “chat”; it “delivers the table.” It sounds simple, but it’s an elegant data-economy trick.

Hermes takes a different route. Dynamic memory. It keeps only the last 3–5 dialogue rounds in working memory. Once it exceeds the limit, a lightweight model summarizes everything into a few key phrases and stores them in a vector database. Knowledge stays; history gets discarded. It’s like memory surgery, not trash being thrown away.

But do you know what the most important mental shift is? Stop thinking of tokens as consumption and start thinking in terms of ROI. Every token spent is an investment. What’s the return? Did the ticket closure rate go up? Did bug-fixing time decrease? Or was it just a meaningless sentence?

If a feature costs 0.1 yuan under traditional rules but costs 1 yuan when integrating a large model with only a 2% increase in conversion, cut it without hesitation. Stop chasing the appeal of “big and comprehensive” AI and move to “small and elegant” AI. Learn how to say “no” to business departments.

I know it’s anticlimactic. It feels very old-fashioned. But that’s exactly how the AI industry will mature. It’s not cyberpunk; it’s more like managing a traditional supermarket—calculating every token the way a grocer calculates each product.

In the end, when the tide goes out, they’ll find out who’s naked. And this time, the tide that went out was the wave of subsidies. Only those who can forge every last drop of token into gold will be dressed for what’s coming next.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin