Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
A week saves 300 million tokens: Anthropic engineer's Claude Code caching guide
Title: How Anthropic Engineers Actually Save Tokens
Author: Nate Herk
Translation: Peggy, BlockBeats
Author: Rhythm BlockBeats
Source:
Repost: Mars Finance
Editor's note: Many people using Claude Code have the most immediate feeling that token consumption is too fast and long conversations easily eat up their quota. But from the perspective of Anthropic engineers, what truly impacts costs is often not how much code you write, but whether the system continuously reuses already processed context.
The core of this article is about how to save tokens through caching mechanisms. The author reused over 300 million tokens within a week via caching, with a single-day cache volume reaching 91 million. Since caching tokens costs only about 10% of regular input tokens, this means that 91 million cached tokens are effectively billed as roughly 9 million regular tokens. The reason why long sessions with Claude Code seem more "durable" is not because the model works for free, but because a large amount of repeated context is successfully reused.
The key to prompt caching is "not breaking the cache." Claude Code layers system prompts, tool definitions, CLAUDE.md, project rules, and historical conversations into cache layers; as long as subsequent requests share the same prefix, Claude can directly read from cache instead of reprocessing the entire context. Anthropic also monitors prompt cache reuse rates internally because it not only affects user quotas but also directly relates to model service costs and operational efficiency.
For ordinary users, there's no need to understand all the underlying details—just grasp a few key habits: do not let conversations idle for more than an hour; perform session handoff when switching tasks; avoid frequent model switching; for large documents, try to put them into Projects instead of repeatedly pasting into conversations.
This article is less about a token-saving trick and more about providing a Claude Code usage approach closer to engineering thinking: treat context as an asset, keep cache reuse ongoing, and minimize repetitive calculations in long sessions.
Below is the original text:
I saved 300 million tokens this week, 91 million in a single day, over 300 million in a week.
I didn't change any settings. This is just prompt caching working normally in the background.
But once I truly understood what caching is and how to avoid "breaking" the cache, my sessions could last longer under the same quota. So, here is a simplified 80/20 beginner's guide to Claude Code prompt caching, without deep API-level details.
TL;DR
Caching tokens costs only about 10% of regular input tokens. 91 million cached tokens roughly equate to billing for 9 million tokens.
Claude Code subscription TTL is 1 hour; API default is 5 minutes; Sub-agent always 5 minutes.
Caching is divided into three layers: system layer, project layer, conversation layer.
Switching models mid-session will break the cache, including when enabling "opus plan" mode.
How is cache billing calculated?
Each cached token costs about 10% of a regular input token.
So, when my dashboard shows 91 million tokens hit the cache in a day, the actual billing is roughly equivalent to processing 9 million tokens. This is why, compared to no caching, long-term use of Claude Code can feel almost "free" in extending conversations.
Two dashboard figures worth noting:
Cache create: the one-time cost when writing content into cache. It starts to take effect in the next dialogue.
Cache read: tokens reused from cache by Claude, such as your CLAUDE.md, tool definitions, previous messages, etc. It costs 10 times less than reprocessing as input.
If your cache read number is high, you're effectively utilizing cache; if it's low, you're paying repeatedly for the same context.
Anthropic's Thariq once said: "We actually monitor prompt cache hit rates, and if the hit rate drops too low, it triggers an alert or even a SEV-level incident."
He also wrote a very good X article. When cache hit rates are high, four things happen simultaneously: Claude Code feels faster, Anthropic's service costs decrease, your subscription quota lasts longer, and long coding sessions become more feasible.
But if the hit rate is low, everyone suffers.
Thus, both sides have aligned incentives: Anthropic wants higher cache hit rates, and users want the same. The real obstacles are some seemingly minor habits that quietly reset the cache.
How does cache grow with each dialogue?
Cache relies on prefix matching—that is, "prefix matching."
Without diving too deep into technical details, you only need to understand: as long as the content before a certain point matches exactly with cached content, Claude can reuse that part of the cache tokens.
A new session roughly unfolds like this:
According to Claude Code documentation, a brand-new session typically runs as follows:
First dialogue: no cache yet. System prompts, your project context (like CLAUDE.md, memory, rules), and your first message are processed anew and written into cache.
Second dialogue: all content from the first is now cached. Claude only needs to process your new reply and the next message. This reduces costs significantly.
Third dialogue: same logic. previous dialogue remains cached; only the latest interaction needs reprocessing.
Cache itself can be divided into three layers:
From Thariq's X article:
System layer: includes basic instructions, tool definitions (read, write, bash, grep, glob), and output styles. This is the global cache layer.
Project layer: includes CLAUDE.md, memory, project rules. Cached per project.
Conversation layer: includes replies and messages, which grow with each dialogue.
If during a session, the system or project layer content changes, all content must be re-cached from scratch. This is the most "expensive" operation. Imagine: you've reached message 16, and then suddenly change the system prompt or pause for an hour, then all tokens from message 1 onward need to be reprocessed.
Confusion about 1 hour vs. 5 minutes TTL
This is the most common misunderstanding.
Claude Code subscription: default TTL is 1 hour.
Claude API: default TTL is 5 minutes. You can pay more to extend it to 1 hour.
Any plan's Sub-agent: always 5 minutes.
Claude.ai web chat: no official record. Possibly same as subscription, but unconfirmed.
A few months ago, many complained that Claude subscription quota was consumed too quickly. Some thought Anthropic quietly reduced TTL from 1 hour to 5 minutes without notice. But that’s not true—the TTL for Claude Code remains 1 hour.
The confusion arose because Claude Code and API documentation are separate, and these are fundamentally different systems, leading to misunderstandings.
If you're running many Sub-agent workflows or directly using the API, the 5-minute TTL matters. But for 95% of Claude Code users, the critical window is the 1-hour TTL.
Three habits to cover 95% of users' needs
Here are the practical tips I find most useful in daily use:
Don’t leave idle for too long
If you’re idle for over an hour, previous content has likely expired from cache. Your next message will rebuild the cache. In this case, instead of continuing an "aged" session, it’s better to do a clear handoff and start fresh, which is usually cheaper.
Switching tasks: just restart directly
/compact or /clear will break the cache anyway, so it’s better to reset at that point.
I made a session handoff skill to replace /compact. It summarizes what we’ve done, pending decisions, key files, and where to continue. Then I run /clear and paste this summary, allowing seamless continuation as if nothing was interrupted.
The handoff usually takes less than a minute, whereas /compact can sometimes be slow.
For large documents, put them into Projects rather than pasting into the chat
While Claude.ai’s cache mechanism isn’t officially detailed, Projects clearly use a different optimization than regular conversations. So, for large documents, it’s best to put them into a Project rather than directly into the chat.
What actions can quietly break the cache?
Several actions can reset the cache without obvious warning:
Switching models: cache depends on prefix matching, and each model has its own cache. Switching models causes the next request to re-read the entire history without cache hits.
"Opus plan" mode: this setting uses Opus during planning and Sonnet during execution. I recommended it in some token optimization videos for a reason. But understand that each plan switch is essentially a model switch, meaning cache is rebuilt. Long-term, it helps extend quota, but you should understand what’s happening under the hood.
Editing CLAUDE.md mid-session: this change doesn’t take effect immediately; it applies after the next restart. The current cache remains unaffected.
My free Token dashboard
The screenshot I showed earlier comes from a token dashboard.
It’s a simple GitHub repo. You give the link to Claude Code, and it deploys locally on localhost, reading all your past sessions instead of starting from scratch. You can see daily input, output, cache create, and cache read stats right away.
Note: this dashboard tracks tokens on your local device. If you switch from desktop to laptop, the numbers won’t match exactly. Each device has its own stats view.
Summary
Prompt caching is a deep topic. Thariq’s article covers it more comprehensively if you want the full picture.
But you don’t need to understand every detail to benefit. Just grasp the key 80/20: cached tokens are 10x cheaper than regular tokens; Claude Code’s TTL is 1 hour; switching models breaks cache; clear handoffs between tasks are usually more cost-effective than letting an old session "expire" and then continuing.