Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
Claude Code Money-saving Tips: Engineers save 300 million tokens a week through caching, the key is not to break the flow
Claude Code long conversations quota? Engineer Nate Herk reveals that he saves 300 million tokens in a week through caching, with a daily maximum of 91 million. The key is not how much code you write, but how to avoid "breaking" the cache, so that repeated context no longer wastes costs.
(Background: The badclaude open-source project that accelerates Claude code has received an infringement notice from Anthropic)
(Additional background: Claude Code has added cloud scheduled task functionality! No need to turn on the computer, AI automatically reviews PRs and upgrades)
Table of Contents
Toggle
Many developers find that when writing code with Claude Code, the most headache-inducing issue is that token quotas quickly run out like water, making long conversations almost a luxury.
But influencer Nate Herk, who often shares AI usage tips in the community, revealed in an X tweet that the real cost killer isn’t code volume, but whether the system makes good use of prompt caching. He personally saved over 300 million tokens in a week, with a peak cache volume of 91 million per day: since cache tokens only cost 10% of regular input tokens, this adds up to about 9 million tokens worth of cost per day, almost "free" for extending the lifespan of coding conversations.
This week I saved 300 million tokens, with 91 million in a single day, over 300 million in a week.
I didn’t change any settings. This is just prompt caching working normally in the background.
But once I truly understood what cache is and how to avoid "breaking" it, I could keep conversations going longer within the same quota. So, here’s a 80/20 beginner’s guide to Claude Code prompt caching, without deep API layer details.
Cache tokens cost only 10% of regular input tokens. 91 million cache tokens roughly equate to 9 million tokens billed.
Claude Code subscription TTL is 1 hour; API default is 5 minutes; Sub-agent is always 5 minutes.
Cache is divided into three layers: system layer, project layer, conversation layer.
Switching models mid-conversation will break the cache, including turning on "opus plan" mode.
Cache cost is only 10%, 91 million tokens equals 9 million
Every cached token costs only 10% of a regular input token.
So, when my dashboard shows that on a certain day 91 million tokens hit the cache, the actual billed amount is roughly equivalent to processing 9 million tokens. This is why, compared to no cache, long-term use of Claude Code feels almost "free" in extending conversation lifespan.
Two numbers in the dashboard are worth paying attention to:
Cache create: the one-time cost when writing content into cache. It takes effect in the next conversation turn.
Cache read: tokens reused from cache by Claude, such as your CLAUDE.md, tool definitions, previous messages, etc. Compared to reprocessing as input, this cost is 10 times lower.
If your Cache read number is high, it indicates effective cache utilization; if low, it means you’re paying repeatedly for the same context.
Anthropic’s Thariq made a deep impression on me: "We actually monitor prompt cache hit rates, and if the hit rate drops too low, alerts are triggered, even SEV-level incidents are declared."
He also wrote a very good X article. When cache hit rate is high, four things happen simultaneously: Claude Code feels faster, Anthropic’s service costs decrease, your subscription quota lasts longer, and long coding sessions become more feasible.
But if the hit rate is low, everyone loses.
Three-layer architecture: system, project, conversation, layered stacking
Thus, both sides’ incentives are aligned: Anthropic hopes your cache hit rate is higher, and you hope it is too. The real drag is some seemingly trivial habits that quietly rebuild the cache.
Cache relies on prefix matching—that is, "matching the beginning of the string."
No need to dive into too many technical details; just understand one point: as long as the content before a certain position matches exactly what’s cached, Claude can reuse that part of the cache tokens.
A new conversation generally unfolds like this:
Based on Claude Code files, a new session usually proceeds as follows:
First turn: no cache yet. System prompt, your project context (like CLAUDE.md, memory, rules), and your first message are processed anew and written into cache.
Second turn: all content from the first turn is now cached. Claude only needs to process your new reply and next message. Costs are much lower here.
Third turn: same logic. previous dialogue remains cached, only the latest interaction needs reprocessing.
Most common "break" trap: model switching and 1-hour gaps
Cache itself can be divided into three layers:
From Thariq’s article:
System layer: includes core instructions, tool definitions (read, write, bash, grep, glob), and output styles. This is globally cached.
Project layer: includes CLAUDE.md, memory, project rules. Cached per project.
Conversation layer: includes replies and messages, which grow with each turn.
If during a session, any change occurs in the system or project layer, all content must be re-cached from scratch. This is the most "expensive" operation. Imagine: you’re at message 16, then suddenly change the system prompt or pause for an hour, then all tokens from message 1 onward need to be reprocessed.
This is the most common misconception.
Claude Code subscription: default TTL is 1 hour.
Engineer-made dashboard: view Cache Read and Create
Claude API: default TTL is 5 minutes. You can pay more to extend it to 1 hour.
Any sub-agent plan: always 5 minutes.
Claude.ai web chat: no official documentation. Possibly same as subscription, but unconfirmed.
Months ago, many complained that Claude quota was consumed too quickly. Some thought Anthropic secretly lowered TTL from 1 hour to 5 minutes without notice. But that’s not true—the TTL for Claude Code remains 1 hour.
The confusion arose because Claude Code and API files are separate, and these are fundamentally different.
If you run many Sub-agent workflows or use the API directly, then the 5-minute figure matters a lot. But for 95% of Claude Code users, the real focus should be on that 1-hour window.
Here are the parts I find truly useful in daily use:
If you’ve been idle for over an hour, previous content has basically expired from cache. Your next message will rebuild the cache. In this case, instead of continuing an "expired" old session, it’s often cheaper to do a clean handoff and start a new session.
/compact or /clear always break the cache, so it’s better to rebuild it at this point.
Practical tip: Session Handoff saves more than /compact
I’ve developed a session handoff skill to replace /compact. It summarizes what’s been done, pending decisions, key files, and where to continue. Then I run /clear, paste this summary in, and continue as if nothing was interrupted.
The /compact command can sometimes be slow. This handoff usually takes less than a minute.
Claude.ai’s cache mechanism isn’t fully documented officially, but Projects clearly use different optimization strategies than regular conversation threads. So, if you want to paste large files, it’s better to put them into a Project rather than directly into the conversation.
Certain actions can rebuild the cache without obvious warning:
Model switching: cache relies on prefix matching, and each model has its own cache. Switching models causes the next request to re-read the entire history without cache hits.
"Opus plan" mode: this setting uses Opus during planning and Sonnet during execution. I recommended it in some token optimization videos, and there’s a reason. But understand that each plan switch is essentially a model switch, which means cache rebuild. Long-term, it still helps extend quota, but you need to understand what’s happening under the hood.
Editing CLAUDE.md mid-conversation is okay: changes won’t take effect immediately, only after restart. So current cache isn’t affected.
The screenshot I showed earlier comes from a token dashboard.
》Original link