Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
A week saves 300 million tokens: Anthropic engineer's Claude Code caching guide
Editor's Note: Many people feel that when using Claude Code, token consumption is too fast and long conversations easily eat up your quota. But from the perspective of Anthropic engineers, what truly impacts costs is often not how much code you write, but whether the system continuously reuses already processed context.
The core of this article is about how to save tokens through caching mechanisms. The author reused over 300 million tokens via caching in one week, with a single-day cache volume reaching 91 million. Since caching tokens costs only 10% of regular input tokens, this means that 91 million cached tokens are effectively billed as about 9 million regular tokens. The reason long sessions with Claude Code seem more "durable" is not because the model works for free, but because a large amount of repeated context is successfully reused.
The key to prompt caching is "not breaking the cache." Claude Code layers system prompts, tool definitions, CLAUDE.md, project rules, and conversation history into cache layers; as long as the subsequent requests share the same prefix, Claude can directly read from cache instead of reprocessing the entire context. Internally, Anthropic also monitors prompt cache reuse rates because it not only affects user quotas but also directly relates to model service costs and operational efficiency.
For ordinary users, there's no need to understand all the underlying details—just grasp a few key habits: do not let conversations idle for more than an hour; perform session handoff when switching tasks; avoid frequent model switching; and for large documents, try to put them into Projects instead of repeatedly pasting into conversations.
This article is less about a token-saving trick and more about providing a Claude Code usage approach closer to engineering thinking: treat context as an asset, enable continuous cache reuse, and minimize repeated calculations in long sessions.
Below is the original text:
This week, I saved 300 million tokens, 91 million in a single day, over 300 million in a week.
I haven't changed any settings. This is just prompt caching working normally in the background.
But once I truly understood what caching is and how to avoid "breaking" the cache, I could sustain longer conversations within the same quota. So, here is a simple 80/20 beginner’s guide to Claude Code prompt caching, without deep API-level details.
TL;DR
Caching tokens costs only 10% of regular input tokens. 91 million cached tokens are billed roughly as 9 million tokens.
Claude Code subscription TTL is 1 hour; API default is 5 minutes; Sub-agent always 5 minutes.
Cache layers are threefold: system layer, project layer, conversation layer.
Switching models mid-conversation breaks the cache, including when enabling "opus plan" mode.
How is cache billing calculated?
Each cached token costs 10% of a regular input token.
So, when my dashboard shows that 91 million tokens were cached on a certain day, the actual billing is roughly equivalent to processing 9 million tokens. This explains why, compared to no caching, long-term use of Claude Code makes conversations feel almost "free" to extend.
Two dashboard figures are worth noting:
Cache create: the one-time cost of writing content into cache. It takes effect in the next dialogue turn.
Cache read: tokens reused from cache by Claude, such as your CLAUDE.md, tool definitions, previous messages, etc. It costs 10 times less than reprocessing as input.
If your cache read number is high, it indicates effective cache utilization; if low, you're paying repeatedly for the same context.
Thariq from Anthropic once said: "We actually monitor prompt cache hit rates, and if the hit rate drops too low, an alert is triggered, even an SEV-level incident is declared."
He also wrote a very good X article. When cache hit rates are high, four things happen simultaneously: Claude Code feels faster, Anthropic’s service costs decrease, your subscription quota lasts longer, and long coding sessions become more feasible.
But if the hit rate is low, everyone suffers.
Thus, both sides have aligned incentives: Anthropic wants higher cache hit rates, and users also want higher hit rates. The real obstacles are some seemingly minor habits that quietly reset the cache.
How does cache grow with each dialogue?
Cache relies on prefix matching, meaning "front-end matching."
No need to delve too deep into technical details—just understand this: as long as the content before a certain point matches exactly with cached content, Claude can reuse those cached tokens.
A new session roughly unfolds like this:
According to Claude Code documentation, a new session typically runs as follows:
First turn: no cache yet. System prompts, your project context (like CLAUDE.md, memory, rules), and your first message are processed anew and written into cache.
Second turn: all content from the first turn is now cached. Claude only needs to process your new reply and the next message. Costs are much lower here.
Third turn: same logic. previous dialogue remains in cache; only the latest interaction needs reprocessing.
Cache itself is divided into three layers:
From Thariq’s X article:
System layer: includes basic instructions, tool definitions (read, write, bash, grep, glob), and output styles. This layer is globally cached.
Project layer: includes CLAUDE.md, memory, project rules. This layer is cached per project.
Conversation layer: includes replies and messages, which grow with each turn.
If during a session, any change occurs in the system layer or project layer, all content must be re-cached from scratch. This is the most "expensive" operation. Imagine: after 16 messages, if you suddenly change the system prompt or pause for an hour, all tokens from message 1 onward need to be reprocessed.
Confusion about 1 hour vs. 5 minutes
This is the most common misunderstanding.
Claude Code subscription: default TTL is 1 hour.
Claude API: default TTL is 5 minutes. You can pay more to extend it to 1 hour.
Any plan’s Sub-agent: always 5 minutes.
Claude.ai web chat: no official record. Possibly same as subscription, but unconfirmed.
A few months ago, many complained that Claude subscription quotas were consumed too quickly. Some thought Anthropic quietly reduced TTL from 1 hour to 5 minutes without notification. But that’s not true—the TTL for Claude Code remains 1 hour.
The confusion arises because Claude Code and API documentation are separate and fundamentally different, leading to misunderstandings.
If you run a lot of Sub-agent workflows or use the API directly, the 5-minute figure is important. But for 95% of Claude Code users, the real focus should be on that 1-hour window.
Three habits to cover 95% of users
Here are the practices I find most useful in daily use:
Don’t leave sessions idle too long
If you’re idle for over an hour, most previous content has expired from cache. Your next message will rebuild the cache. In this case, instead of continuing an "aged" session, it’s better to do a clear handoff and start fresh, which is usually cheaper.
When switching tasks, restart directly
/compact or /clear will break the cache anyway, so it’s better to reset at that point.
I’ve developed a session handoff skill to replace /compact. It summarizes what’s been done, pending decisions, key files, and next steps. Then I execute /clear and paste this summary in, allowing continuation as if nothing was interrupted.
The handoff usually takes less than a minute, whereas /compact can sometimes be slow.
For large documents, put them into Projects whenever possible
Claude.ai’s cache mechanism isn’t fully documented, but Projects clearly use different optimization strategies than regular conversations. So, if you need to paste large documents, it’s better to put them into a Project rather than directly into the chat.
What actions can quietly break the cache?
Several things can reset the cache without obvious warning.
Switching models: cache depends on prefix matching, and each model has its own cache. Changing models causes the next request to re-read the entire history without cache hits.
"Opus plan" mode: this setting uses Opus during planning and Sonnet during execution. I recommended it in some token optimization videos for a reason. But understand that switching plans essentially means switching models, which resets the cache. Long-term, it still helps extend session quotas, but you should know what’s happening under the hood.
Editing CLAUDE.md mid-session is okay: changes won’t take effect immediately and will only apply after restart. So, current cache remains unaffected.
My free Token Dashboard
The screenshot I showed earlier comes from a token dashboard.
It’s a simple GitHub repo. You give Claude Code the link, and it deploys locally on localhost. It then reads all your past sessions instead of starting from scratch. You can see daily input, output, cache create, and cache read stats right away.
Note: this dashboard tracks tokens on your local device. If you switch from desktop to laptop, the numbers won’t match exactly. Each device has its own stats view.
Summary
Prompt caching is a deep topic. Thariq’s article covers it more comprehensively. If you want the full picture, it’s worth reading.
But you don’t need to understand every detail to benefit. Just grasp the key 80/20: cached tokens are 10x cheaper than regular tokens; Claude Code’s TTL is 1 hour; changing models breaks cache; clear and handoff between tasks is usually more cost-effective than letting an old session "expire" and then resuming.
[Original link]
Click to learn about Rhythm BlockBeats job openings
Join the official Rhythm BlockBeats community:
Telegram Subscription Group: https://t.me/theblockbeats
Telegram Group: https://t.me/BlockBeats_App
Twitter Official Account: https://twitter.com/BlockBeatsAsia