A week saves 300 million tokens: Anthropic engineer's Claude Code caching guide

Original Title: How Anthropic Engineers Actually Save Tokens
Original Author: Nate Herk
Translation: Peggy, BlockBeats

Editor's Note: Many people feel that when using Claude Code, token consumption is too fast and long conversations easily eat up your quota. But from the perspective of Anthropic engineers, what truly impacts costs is often not how much code you write, but whether the system continuously reuses already processed context.

The core of this article is about how to save tokens through caching mechanisms. The author reused over 300 million tokens via caching in one week, with a single-day cache volume reaching 91 million. Since caching tokens costs only 10% of regular input tokens, this means that 91 million cached tokens are effectively billed as about 9 million regular tokens. The reason long sessions with Claude Code seem more "durable" is not because the model works for free, but because a large amount of repeated context is successfully reused.

The key to prompt caching is "not breaking the cache." Claude Code layers system prompts, tool definitions, CLAUDE.md, project rules, and conversation history into cache layers; as long as the subsequent requests share the same prefix, Claude can directly read from cache instead of reprocessing the entire context. Internally, Anthropic also monitors prompt cache reuse rates because it not only affects user quotas but also directly relates to model service costs and operational efficiency.

For ordinary users, there's no need to understand all the underlying details—just grasp a few key habits: do not let conversations idle for more than an hour; perform session handoff when switching tasks; avoid frequent model switching; and for large documents, try to put them into Projects instead of repeatedly pasting into conversations.

This article is less about a token-saving trick and more about providing a Claude Code usage approach closer to engineering thinking: treat context as an asset, enable continuous cache reuse, and minimize repeated calculations in long sessions.

Below is the original text:

This week, I saved 300 million tokens, 91 million in a single day, over 300 million in a week.

I haven't changed any settings. This is just prompt caching working normally in the background.

But once I truly understood what caching is and how to avoid "breaking" the cache, I could sustain longer conversations within the same quota. So, here is a simple 80/20 beginner’s guide to Claude Code prompt caching, without deep API-level details.

TL;DR

Caching tokens costs only 10% of regular input tokens. 91 million cached tokens are billed roughly as 9 million tokens.

Claude Code subscription TTL is 1 hour; API default is 5 minutes; Sub-agent always 5 minutes.

Cache layers are threefold: system layer, project layer, conversation layer.

Switching models mid-conversation breaks the cache, including when enabling "opus plan" mode.

How is cache billing calculated?

Each cached token costs 10% of a regular input token.

So, when my dashboard shows that 91 million tokens were cached on a certain day, the actual billing is roughly equivalent to processing 9 million tokens. This explains why, compared to no caching, long-term use of Claude Code makes conversations feel almost "free" to extend.

Two dashboard figures are worth noting:

Cache create: the one-time cost of writing content into cache. It takes effect in the next dialogue turn.
Cache read: tokens reused from cache by Claude, such as your CLAUDE.md, tool definitions, previous messages, etc. It costs 10 times less than reprocessing as input.

If your cache read number is high, it indicates effective cache utilization; if low, you're paying repeatedly for the same context.

Thariq from Anthropic once said: "We actually monitor prompt cache hit rates, and if the hit rate drops too low, an alert is triggered, even an SEV-level incident is declared."

He also wrote a very good X article. When cache hit rates are high, four things happen simultaneously: Claude Code feels faster, Anthropic’s service costs decrease, your subscription quota lasts longer, and long coding sessions become more feasible.

But if the hit rate is low, everyone suffers.

Thus, both sides have aligned incentives: Anthropic wants higher cache hit rates, and users also want higher hit rates. The real obstacles are some seemingly minor habits that quietly reset the cache.

How does cache grow with each dialogue?

Cache relies on prefix matching, meaning "front-end matching."

No need to delve too deep into technical details—just understand this: as long as the content before a certain point matches exactly with cached content, Claude can reuse those cached tokens.

A new session roughly unfolds like this:

According to Claude Code documentation, a new session typically runs as follows:

First turn: no cache yet. System prompts, your project context (like CLAUDE.md, memory, rules), and your first message are processed anew and written into cache.

Second turn: all content from the first turn is now cached. Claude only needs to process your new reply and the next message. Costs are much lower here.

Third turn: same logic. previous dialogue remains in cache; only the latest interaction needs reprocessing.

Cache itself is divided into three layers:

From Thariq’s X article:

System layer: includes basic instructions, tool definitions (read, write, bash, grep, glob), and output styles. This layer is globally cached.

Project layer: includes CLAUDE.md, memory, project rules. This layer is cached per project.

Conversation layer: includes replies and messages, which grow with each turn.

If during a session, any change occurs in the system layer or project layer, all content must be re-cached from scratch. This is the most "expensive" operation. Imagine: after 16 messages, if you suddenly change the system prompt or pause for an hour, all tokens from message 1 onward need to be reprocessed.

Confusion about 1 hour vs. 5 minutes

This is the most common misunderstanding.

Claude Code subscription: default TTL is 1 hour.

Claude API: default TTL is 5 minutes. You can pay more to extend it to 1 hour.
Any plan’s Sub-agent: always 5 minutes.

Claude.ai web chat: no official record. Possibly same as subscription, but unconfirmed.

A few months ago, many complained that Claude subscription quotas were consumed too quickly. Some thought Anthropic quietly reduced TTL from 1 hour to 5 minutes without notification. But that’s not true—the TTL for Claude Code remains 1 hour.

The confusion arises because Claude Code and API documentation are separate and fundamentally different, leading to misunderstandings.

If you run a lot of Sub-agent workflows or use the API directly, the 5-minute figure is important. But for 95% of Claude Code users, the real focus should be on that 1-hour window.

Three habits to cover 95% of users

Here are the practices I find most useful in daily use:

Don’t leave sessions idle too long

If you’re idle for over an hour, most previous content has expired from cache. Your next message will rebuild the cache. In this case, instead of continuing an "aged" session, it’s better to do a clear handoff and start fresh, which is usually cheaper.

When switching tasks, restart directly

/compact or /clear will break the cache anyway, so it’s better to reset at that point.

I’ve developed a session handoff skill to replace /compact. It summarizes what’s been done, pending decisions, key files, and next steps. Then I execute /clear and paste this summary in, allowing continuation as if nothing was interrupted.

The handoff usually takes less than a minute, whereas /compact can sometimes be slow.

For large documents, put them into Projects whenever possible

Claude.ai’s cache mechanism isn’t fully documented, but Projects clearly use different optimization strategies than regular conversations. So, if you need to paste large documents, it’s better to put them into a Project rather than directly into the chat.

What actions can quietly break the cache?

Several things can reset the cache without obvious warning.

Switching models: cache depends on prefix matching, and each model has its own cache. Changing models causes the next request to re-read the entire history without cache hits.

"Opus plan" mode: this setting uses Opus during planning and Sonnet during execution. I recommended it in some token optimization videos for a reason. But understand that switching plans essentially means switching models, which resets the cache. Long-term, it still helps extend session quotas, but you should know what’s happening under the hood.

Editing CLAUDE.md mid-session is okay: changes won’t take effect immediately and will only apply after restart. So, current cache remains unaffected.

My free Token Dashboard

The screenshot I showed earlier comes from a token dashboard.

https://github.com/nateherkai/token-dashboard

It’s a simple GitHub repo. You give Claude Code the link, and it deploys locally on localhost. It then reads all your past sessions instead of starting from scratch. You can see daily input, output, cache create, and cache read stats right away.

Note: this dashboard tracks tokens on your local device. If you switch from desktop to laptop, the numbers won’t match exactly. Each device has its own stats view.

Summary

Prompt caching is a deep topic. Thariq’s article covers it more comprehensively. If you want the full picture, it’s worth reading.

But you don’t need to understand every detail to benefit. Just grasp the key 80/20: cached tokens are 10x cheaper than regular tokens; Claude Code’s TTL is 1 hour; changing models breaks cache; clear and handoff between tasks is usually more cost-effective than letting an old session "expire" and then resuming.

[Original link]

Click to learn about Rhythm BlockBeats job openings

Join the official Rhythm BlockBeats community:

Telegram Subscription Group: https://t.me/theblockbeats

Telegram Group: https://t.me/BlockBeats_App

Twitter Official Account: https://twitter.com/BlockBeatsAsia

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 8
  • 2
  • Share
Comment
Add a comment
Add a comment
RouterWhisperer
· 9h ago
Cache reuse is the real core of cost reduction; 300 million tokens in a week is too exaggerated.
View OriginalReply0
YieldGoblin
· 18h ago
BlockBeats' compilation quality has always been top-notch; here's some valuable content.
View OriginalReply0
GateUser-047cb6fc
· 18h ago
Save tokens = Save money, learn this move now
View OriginalReply0
Mint-ColoredSlippage
· 18h ago
It's not that tokens are expensive, it's that you're using them foolishly. Isn't caching more appealing?
View OriginalReply0
TheWindOnTheBridgeIsTooStrong.
· 18h ago
Internal insights written by Anthropic insiders, highly valuable as a reference
View OriginalReply0
0xSecondThought
· 18h ago
I finally understand why my Claude Code bill was so outrageous.
View OriginalReply0
MetalKeyInsomnia
· 18h ago
The long conversation killer has finally been found.
View OriginalReply0
SoftRugDetective
· 18h ago
System reuses context, isn't this the Redis version of LLM?
View OriginalReply0
  • Pinned