Claude Code Money-saving Tips: Engineers save 300 million tokens a week through caching, the key is not to break the flow

Claude Code long conversations quota? Engineer Nate Herk reveals that he saves 300 million tokens in a week through caching, with a daily maximum of 91 million. The key is not how much code you write, but how to avoid "breaking" the cache, so that repeated context no longer wastes costs.
(Background: The badclaude open-source project that accelerates Claude code has received an infringement notice from Anthropic)
(Additional background: Claude Code has added cloud scheduled task functionality! No need to turn on the computer, AI automatically reviews PRs and upgrades)

Table of Contents

Toggle

  • Cache cost is only 10%, 91 million tokens equals 9 million
  • Three-layer architecture: system, project, conversation, layered stacking
  • Most common "break" trap: model switching and 1-hour gaps
  • Engineer-made dashboard: view Cache Read and Create
  • Practical tips: Session Handoff saves more than /compact

Many developers find that when writing code with Claude Code, the most headache-inducing issue is that token quotas quickly run out like water, making long conversations almost a luxury.

But influencer Nate Herk, who often shares AI usage tips in the community, revealed in an X tweet that the real cost killer isn’t code volume, but whether the system makes good use of prompt caching. He personally saved over 300 million tokens in a week, with a peak cache volume of 91 million per day: since cache tokens only cost 10% of regular input tokens, this adds up to about 9 million tokens worth of cost per day, almost "free" for extending the lifespan of coding conversations.


This week I saved 300 million tokens, with 91 million in a single day, over 300 million in a week.

I didn’t change any settings. This is just prompt caching working normally in the background.

But once I truly understood what cache is and how to avoid "breaking" it, I could keep conversations going longer within the same quota. So, here’s a 80/20 beginner’s guide to Claude Code prompt caching, without deep API layer details.

Cache tokens cost only 10% of regular input tokens. 91 million cache tokens roughly equate to 9 million tokens billed.

Claude Code subscription TTL is 1 hour; API default is 5 minutes; Sub-agent is always 5 minutes.

Cache is divided into three layers: system layer, project layer, conversation layer.

Switching models mid-conversation will break the cache, including turning on "opus plan" mode.

coding agents need glass boxes now

jianshuo/ccglass

111 stars on github
created yesterday
mit + javascript
local proxy + web dashboard for claude code, codex, deepseek-tui, and kimi
shows the full system prompt, tool schemas, message history, token/cache/cost, and… pic.twitter.com/Wot5SFV16N

— Beau Johnson (@BeauJohnson89) May 24, 2026

Cache cost is only 10%, 91 million tokens equals 9 million

Every cached token costs only 10% of a regular input token.

So, when my dashboard shows that on a certain day 91 million tokens hit the cache, the actual billed amount is roughly equivalent to processing 9 million tokens. This is why, compared to no cache, long-term use of Claude Code feels almost "free" in extending conversation lifespan.

Two numbers in the dashboard are worth paying attention to:

Cache create: the one-time cost when writing content into cache. It takes effect in the next conversation turn.
Cache read: tokens reused from cache by Claude, such as your CLAUDE.md, tool definitions, previous messages, etc. Compared to reprocessing as input, this cost is 10 times lower.

If your Cache read number is high, it indicates effective cache utilization; if low, it means you’re paying repeatedly for the same context.

Anthropic’s Thariq made a deep impression on me: "We actually monitor prompt cache hit rates, and if the hit rate drops too low, alerts are triggered, even SEV-level incidents are declared."

He also wrote a very good X article. When cache hit rate is high, four things happen simultaneously: Claude Code feels faster, Anthropic’s service costs decrease, your subscription quota lasts longer, and long coding sessions become more feasible.

But if the hit rate is low, everyone loses.

Three-layer architecture: system, project, conversation, layered stacking

Thus, both sides’ incentives are aligned: Anthropic hopes your cache hit rate is higher, and you hope it is too. The real drag is some seemingly trivial habits that quietly rebuild the cache.

Cache relies on prefix matching—that is, "matching the beginning of the string."

No need to dive into too many technical details; just understand one point: as long as the content before a certain position matches exactly what’s cached, Claude can reuse that part of the cache tokens.

A new conversation generally unfolds like this:

Based on Claude Code files, a new session usually proceeds as follows:

First turn: no cache yet. System prompt, your project context (like CLAUDE.md, memory, rules), and your first message are processed anew and written into cache.

Second turn: all content from the first turn is now cached. Claude only needs to process your new reply and next message. Costs are much lower here.

Third turn: same logic. previous dialogue remains cached, only the latest interaction needs reprocessing.

Most common "break" trap: model switching and 1-hour gaps

Cache itself can be divided into three layers:

From Thariq’s article:

System layer: includes core instructions, tool definitions (read, write, bash, grep, glob), and output styles. This is globally cached.

Project layer: includes CLAUDE.md, memory, project rules. Cached per project.

Conversation layer: includes replies and messages, which grow with each turn.

If during a session, any change occurs in the system or project layer, all content must be re-cached from scratch. This is the most "expensive" operation. Imagine: you’re at message 16, then suddenly change the system prompt or pause for an hour, then all tokens from message 1 onward need to be reprocessed.

This is the most common misconception.

Claude Code subscription: default TTL is 1 hour.

Engineer-made dashboard: view Cache Read and Create

Claude API: default TTL is 5 minutes. You can pay more to extend it to 1 hour.
Any sub-agent plan: always 5 minutes.

Claude.ai web chat: no official documentation. Possibly same as subscription, but unconfirmed.

Months ago, many complained that Claude quota was consumed too quickly. Some thought Anthropic secretly lowered TTL from 1 hour to 5 minutes without notice. But that’s not true—the TTL for Claude Code remains 1 hour.

The confusion arose because Claude Code and API files are separate, and these are fundamentally different.

If you run many Sub-agent workflows or use the API directly, then the 5-minute figure matters a lot. But for 95% of Claude Code users, the real focus should be on that 1-hour window.

Here are the parts I find truly useful in daily use:

If you’ve been idle for over an hour, previous content has basically expired from cache. Your next message will rebuild the cache. In this case, instead of continuing an "expired" old session, it’s often cheaper to do a clean handoff and start a new session.

/compact or /clear always break the cache, so it’s better to rebuild it at this point.

Practical tip: Session Handoff saves more than /compact

I’ve developed a session handoff skill to replace /compact. It summarizes what’s been done, pending decisions, key files, and where to continue. Then I run /clear, paste this summary in, and continue as if nothing was interrupted.

The /compact command can sometimes be slow. This handoff usually takes less than a minute.

Claude.ai’s cache mechanism isn’t fully documented officially, but Projects clearly use different optimization strategies than regular conversation threads. So, if you want to paste large files, it’s better to put them into a Project rather than directly into the conversation.

Certain actions can rebuild the cache without obvious warning:

Model switching: cache relies on prefix matching, and each model has its own cache. Switching models causes the next request to re-read the entire history without cache hits.
"Opus plan" mode: this setting uses Opus during planning and Sonnet during execution. I recommended it in some token optimization videos, and there’s a reason. But understand that each plan switch is essentially a model switch, which means cache rebuild. Long-term, it still helps extend quota, but you need to understand what’s happening under the hood.

Editing CLAUDE.md mid-conversation is okay: changes won’t take effect immediately, only after restart. So current cache isn’t affected.

The screenshot I showed earlier comes from a token dashboard.

https://github.com/nateherkai/token-dashboard
This is a simple GitHub repository. You give the link to Claude Code, and it deploys locally on localhost. It reads all your past sessions, not starting from scratch. You can see daily input, output, cache create, and cache read data right away.

But note: this dashboard tracks tokens on your local device. If you switch from desktop to laptop, the numbers won’t match exactly. Each device has its own stats.

Prompt caching is a deep topic. Thariq’s article covers it more comprehensively. If you want the full picture, it’s worth reading.

But you don’t need to understand every detail to benefit. Just grasp the key 80/20: cache tokens cost 10% of normal tokens; Claude Code TTL is 1 hour; model switching breaks cache; clear handoff is usually more cost-effective than letting an old session "expire" and continuing.

》Original link

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned