A week saves 300 million tokens: Anthropic engineer's Claude Code caching guide

Title: How Anthropic Engineers Actually Save Tokens
Author: Nate Herk
Translation: Peggy, BlockBeats

Author: Rhythm BlockBeats

Source:

Repost: Mars Finance

Editor's note: Many people using Claude Code have the most immediate feeling that token consumption is too fast and long conversations easily eat up their quota. But from the perspective of Anthropic engineers, what truly impacts costs is often not how much code you write, but whether the system continuously reuses already processed context.

The core of this article is about how to save tokens through caching mechanisms. The author reused over 300 million tokens within a week via caching, with a single-day cache volume reaching 91 million. Since caching tokens costs only about 10% of regular input tokens, this means that 91 million cached tokens are effectively billed as roughly 9 million regular tokens. The reason why long sessions with Claude Code seem more "durable" is not because the model works for free, but because a large amount of repeated context is successfully reused.

The key to prompt caching is "not breaking the cache." Claude Code layers system prompts, tool definitions, CLAUDE.md, project rules, and historical conversations into cache layers; as long as subsequent requests share the same prefix, Claude can directly read from cache instead of reprocessing the entire context. Anthropic also monitors prompt cache reuse rates internally because it not only affects user quotas but also directly relates to model service costs and operational efficiency.

For ordinary users, there's no need to understand all the underlying details—just grasp a few key habits: do not let conversations idle for more than an hour; perform session handoff when switching tasks; avoid frequent model switching; for large documents, try to put them into Projects instead of repeatedly pasting into conversations.

This article is less about a token-saving trick and more about providing a Claude Code usage approach closer to engineering thinking: treat context as an asset, keep cache reuse ongoing, and minimize repetitive calculations in long sessions.

Below is the original text:

I saved 300 million tokens this week, 91 million in a single day, over 300 million in a week.

I didn't change any settings. This is just prompt caching working normally in the background.

But once I truly understood what caching is and how to avoid "breaking" the cache, my sessions could last longer under the same quota. So, here is a simplified 80/20 beginner's guide to Claude Code prompt caching, without deep API-level details.

TL;DR

Caching tokens costs only about 10% of regular input tokens. 91 million cached tokens roughly equate to billing for 9 million tokens.

Claude Code subscription TTL is 1 hour; API default is 5 minutes; Sub-agent always 5 minutes.

Caching is divided into three layers: system layer, project layer, conversation layer.

Switching models mid-session will break the cache, including when enabling "opus plan" mode.

How is cache billing calculated?

Each cached token costs about 10% of a regular input token.

So, when my dashboard shows 91 million tokens hit the cache in a day, the actual billing is roughly equivalent to processing 9 million tokens. This is why, compared to no caching, long-term use of Claude Code can feel almost "free" in extending conversations.

Two dashboard figures worth noting:

Cache create: the one-time cost when writing content into cache. It starts to take effect in the next dialogue.
Cache read: tokens reused from cache by Claude, such as your CLAUDE.md, tool definitions, previous messages, etc. It costs 10 times less than reprocessing as input.

If your cache read number is high, you're effectively utilizing cache; if it's low, you're paying repeatedly for the same context.

Anthropic's Thariq once said: "We actually monitor prompt cache hit rates, and if the hit rate drops too low, it triggers an alert or even a SEV-level incident."

He also wrote a very good X article. When cache hit rates are high, four things happen simultaneously: Claude Code feels faster, Anthropic's service costs decrease, your subscription quota lasts longer, and long coding sessions become more feasible.

But if the hit rate is low, everyone suffers.

Thus, both sides have aligned incentives: Anthropic wants higher cache hit rates, and users want the same. The real obstacles are some seemingly minor habits that quietly reset the cache.

How does cache grow with each dialogue?

Cache relies on prefix matching—that is, "prefix matching."

Without diving too deep into technical details, you only need to understand: as long as the content before a certain point matches exactly with cached content, Claude can reuse that part of the cache tokens.

A new session roughly unfolds like this:

According to Claude Code documentation, a brand-new session typically runs as follows:

First dialogue: no cache yet. System prompts, your project context (like CLAUDE.md, memory, rules), and your first message are processed anew and written into cache.
Second dialogue: all content from the first is now cached. Claude only needs to process your new reply and the next message. This reduces costs significantly.
Third dialogue: same logic. previous dialogue remains cached; only the latest interaction needs reprocessing.

Cache itself can be divided into three layers:

From Thariq's X article:

System layer: includes basic instructions, tool definitions (read, write, bash, grep, glob), and output styles. This is the global cache layer.
Project layer: includes CLAUDE.md, memory, project rules. Cached per project.
Conversation layer: includes replies and messages, which grow with each dialogue.

If during a session, the system or project layer content changes, all content must be re-cached from scratch. This is the most "expensive" operation. Imagine: you've reached message 16, and then suddenly change the system prompt or pause for an hour, then all tokens from message 1 onward need to be reprocessed.

Confusion about 1 hour vs. 5 minutes TTL

This is the most common misunderstanding.

Claude Code subscription: default TTL is 1 hour.
Claude API: default TTL is 5 minutes. You can pay more to extend it to 1 hour.
Any plan's Sub-agent: always 5 minutes.

Claude.ai web chat: no official record. Possibly same as subscription, but unconfirmed.

A few months ago, many complained that Claude subscription quota was consumed too quickly. Some thought Anthropic quietly reduced TTL from 1 hour to 5 minutes without notice. But that’s not true—the TTL for Claude Code remains 1 hour.

The confusion arose because Claude Code and API documentation are separate, and these are fundamentally different systems, leading to misunderstandings.

If you're running many Sub-agent workflows or directly using the API, the 5-minute TTL matters. But for 95% of Claude Code users, the critical window is the 1-hour TTL.

Three habits to cover 95% of users' needs

Here are the practical tips I find most useful in daily use:

Don’t leave idle for too long

If you’re idle for over an hour, previous content has likely expired from cache. Your next message will rebuild the cache. In this case, instead of continuing an "aged" session, it’s better to do a clear handoff and start fresh, which is usually cheaper.

Switching tasks: just restart directly

/compact or /clear will break the cache anyway, so it’s better to reset at that point.

I made a session handoff skill to replace /compact. It summarizes what we’ve done, pending decisions, key files, and where to continue. Then I run /clear and paste this summary, allowing seamless continuation as if nothing was interrupted.

The handoff usually takes less than a minute, whereas /compact can sometimes be slow.

For large documents, put them into Projects rather than pasting into the chat

While Claude.ai’s cache mechanism isn’t officially detailed, Projects clearly use a different optimization than regular conversations. So, for large documents, it’s best to put them into a Project rather than directly into the chat.

What actions can quietly break the cache?

Several actions can reset the cache without obvious warning:

Switching models: cache depends on prefix matching, and each model has its own cache. Switching models causes the next request to re-read the entire history without cache hits.
"Opus plan" mode: this setting uses Opus during planning and Sonnet during execution. I recommended it in some token optimization videos for a reason. But understand that each plan switch is essentially a model switch, meaning cache is rebuilt. Long-term, it helps extend quota, but you should understand what’s happening under the hood.
Editing CLAUDE.md mid-session: this change doesn’t take effect immediately; it applies after the next restart. The current cache remains unaffected.

My free Token dashboard

The screenshot I showed earlier comes from a token dashboard.

It’s a simple GitHub repo. You give the link to Claude Code, and it deploys locally on localhost, reading all your past sessions instead of starting from scratch. You can see daily input, output, cache create, and cache read stats right away.

Note: this dashboard tracks tokens on your local device. If you switch from desktop to laptop, the numbers won’t match exactly. Each device has its own stats view.

Summary

Prompt caching is a deep topic. Thariq’s article covers it more comprehensively if you want the full picture.

But you don’t need to understand every detail to benefit. Just grasp the key 80/20: cached tokens are 10x cheaper than regular tokens; Claude Code’s TTL is 1 hour; switching models breaks cache; clear handoffs between tasks are usually more cost-effective than letting an old session "expire" and then continuing.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 5
  • Repost
  • Share
Comment
Add a comment
Add a comment
GateUser-0fdb3438
· 6h ago
Caching strategy +1, next time when designing the architecture, make sure to plan the context lifecycle properly.
View OriginalReply0
BudgetDeFi
· 9h ago
Cache reuse is the real skill in cost reduction; saving 300 million tokens is enough to run how many rounds of testing?
View OriginalReply0
0xPeachy
· 9h ago
I want to know how much of this 300 million consists of repeated code snippets; I feel like the code reuse rate in projects should be quite high.
View OriginalReply0
DrawTheCandlestickChartIn
· 9h ago
Claude Code users are ecstatic, finally knowing where the credit limit went
View OriginalReply0
GateUser-83c80dd0
· 9h ago
91 million daily cache hits, how high must the hit rate be? Curious about the details of their caching strategy.
View OriginalReply0
  • Pinned