Claude Token Optimization: What Actually Moves the Number
An evidence-backed guide to reducing Claude token spend, ranked by impact. Every claim cited to primary Anthropic documentation, official engineering writing, or cross-vendor confirmation.
© 2026 Rick Watson / RMW Commerce Consulting. This compilation, its ranking, and the original commentary are copyrighted. The underlying techniques originate from Anthropic’s official Claude API and Claude Code documentation, supplemented by Anthropic engineering writing and named independent practitioners — see Sources & Attribution. Quoting brief excerpts with attribution is fine. Republishing the compilation in whole or in substantial part requires written permission: rick@rmwcommerce.com.
TL;DR — what’s in it for you
If you use Claude via the API or Claude Code with any regularity, the patterns below will measurably reduce what you spend:
- Prompt caching cuts input costs by up to 90% on repeated workloads — and it’s a one-time API change
- Routing mechanical tasks to a cheaper model costs five times less per token with no quality penalty
- Context discipline (clearing between tasks, keeping CLAUDE.md under 200 lines) eliminates the slow invisible tax on every session
- Surgical tool call design cut one measured workload from 191,300 tokens to 122,800 — a 36% reduction in a single refactor
Where to spend your time, in priority order
Most readers should implement the first two and stop. They account for the largest share of preventable spend.
| # | Practice | Why it matters | Effort |
|---|---|---|---|
| 1 | Prompt caching — structure stable content first, mark with cache_control |
Cache hits bill at 0.1× input. Anthropic quotes up to 90% cost reduction on repeated workloads | One-time refactor, 1–2 hrs |
| 2 | Model routing — Haiku for mechanical work, Sonnet as default, Opus only when needed | Haiku costs $1/MTok input vs. Opus at $15/MTok — a 15× spread. Most tasks don’t need the expensive model | Per-task discipline |
| 3 | Context discipline — /clear between tasks, /compact near 50%, CLAUDE.md under 200 lines |
Context overhead applies to every turn. A 200-line CLAUDE.md that bloats to 600 is a cost increase on every session, invisible until you measure it | 20 min to trim, ongoing |
| 4 | Tool call design — consolidate reads, use parallel calls, strip unused tools from the system prompt | Anthropic’s own engineering measured 37% token reduction (43,588 → 27,297 avg per call) with programmatic tool consolidation | Per-tool refactor |
| 5 | Output length control — prompt for concise responses; skip extended thinking on routine work | Thinking tokens are billed even when not shown. Routine tasks get no quality gain from extended reasoning | Prompt-level change |
| 6 | Batch API for async workloads — non-interactive processing at 50% lower cost | Purpose-built for offline jobs. No SLA on timing, so only use when latency is irrelevant | Architectural decision |
How to use this
You don’t need to read all of it. The operational form is a one-line install that runs the audit against your actual Claude session data — no questions, no manual review:
curl --proto '=https' --tlsv1.2 -fsSL https://tokenmin.ai/install.sh | bash
tokenmin --days 30 --out report.md
Tokenmin reads your ~/.claude/ session logs, hashes anything identifying before any analysis runs, and produces a ranked report against the levers below — caching, model routing, context discipline, and the rest. About 60 seconds end-to-end.
Why it’s safe to install. The scanner that decides what (if anything) leaves your machine is Apache-2.0 at github.com/watsonrm/tokenmin-scanner — readable end-to-end before you trust it. Identifiers are HMAC-hashed with a per-install salt; nothing leaves the machine unless you pass --submit-url. See SECURITY.md for the threat model.
If you’d rather do this by hand: the claude-token-optimization skill runs the same audit via Q&A in a Claude Code session. Drop it into ~/.claude/skills/ and say “audit my token spend.” The article below is the reasoning behind each lever — read it for the why; either Tokenmin or the skill is the how.
The honest framing
Token cost is a function of context size × price per token × frequency. Anthropic’s own engineering team states directly:
“Agentic systems often trade latency and cost for better task performance.”
— Anthropic, “Building effective agents” (Dec 19, 2024)
Cost is not always a problem to solve. On high-value work, paying more for a slower, more capable model is correct. The goal is to stop paying for quality on tasks that don’t need it — and to stop reprocessing the same stable content on every call.
Tokenmin’s scanner reads your Claude session logs and surfaces where you’re paying the most. The rankings here align with what its detectors consistently find first in real workloads.
Lever 1: Prompt caching — the largest single reduction
What you’re paying without it
Every API call reprocesses every token you send. A 5,000-token system prompt in a session with 50 turns means 250,000 tokens of overhead before your actual questions touch the model. On any workload with repeated stable content — a reference document, a CLAUDE.md, a codebase read at session start — that overhead dominates the bill.
What caching does
You mark stable sections with cache_control. The API stores them server-side. Subsequent calls that hit the same prefix read from cache instead of reprocessing. Cache reads bill at 0.1× input; cache writes at 1.25× (5-min TTL) or 2× (1-hour TTL).
Anthropic’s published numbers: up to 90% cost savings on cached content. Their own blog post on building Claude Code calls prompt caching “everything”:
“A few percentage points of cache miss rate can dramatically affect cost and latency.”
— Anthropic, “Lessons from building Claude Code: prompt caching is everything”
Pricing table (Claude Sonnet 4.6)
| Token type | Cost per million | Multiplier vs. standard |
|---|---|---|
| Standard input | $3.00 | 1× |
| Cache write (5-min TTL) | $3.75 | 1.25× |
| Cache write (1-hour TTL) | $6.00 | 2× |
| Cache hit | $0.30 | 0.1× |
(Source: Anthropic prompt caching docs)
The caveat
Cache is worth it on repeated workloads only. Simon Willison flagged this at launch:
“Apps prompting less frequently than once per five minutes will see negative ROI since cache writing costs exceed retrieval savings.”
— Simon Willison, “Prompt caching with Claude” (Aug 14, 2024)
Check your traffic pattern before enabling caching app-wide. Sporadic, one-off queries don’t benefit.
Model minimums
Cache kicks in at 1,024 tokens for Sonnet 4.6 and Sonnet 4.5. Opus 4.7, Opus 4.5, and Haiku 4.5 require 4,096 tokens before caching applies. Shorter prompts process without caching — no error returned, just no savings.
What belongs in the cache
Cache these: system prompts, tool definitions, reference documents, long few-shot blocks, code files read repeatedly at session start.
Don’t cache: the user’s current message, dynamic state, turn-specific tool results. Caching dynamic content provides no benefit and wastes the write cost.
API structure
{
"model": "claude-sonnet-4-6",
"system": [
{
"type": "text",
"text": "You are a data analyst with these rules...",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "[5,000-token reference document here]",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "Which SKUs had the highest return rate?"}
]
}
In Claude Code
Claude Code manages caching automatically. Your CLAUDE.md, tool definitions, and stable project context sit in the cached prefix. What invalidates the cache mid-session (source):
- Switching models — each model has its own cache namespace
- Connecting or disconnecting an MCP server — tool definitions sit in the system prompt layer
- Running
/compact— replaces history with a summary, rebuilding the prefix - Upgrading Claude Code — new version typically changes the system prompt
Pick model and connect MCP servers at session start. Save /compact for natural breaks between tasks.
Lever 2: Model routing — the simplest cost lever
The spread
| Model | Comparative latency | Input (per MTok) | Output (per MTok) |
|---|---|---|---|
| Claude Haiku 4.5 | Fastest | $1 | $5 |
| Claude Sonnet 4.6 | Fast | $3 | $15 |
| Claude Opus 4.7 | Moderate | $15 | $75 |
(Source: Anthropic models overview)
Opus costs 15× more per input token than Haiku — and 15× more per output token too ($75/MTok vs. $5/MTok). On agentic workloads where output dominates the bill, the output spread is the bigger line item. For mechanical work — file reads, reformatting, simple lookups, yes/no checks — Haiku delivers the same result at a fraction of the cost on both sides.
The discipline
- Haiku: file reads, reformatting, search-and-replace refactors, yes/no checks, single-fact lookups.
- Sonnet: the daily driver. Multi-file code generation, code review, business logic, most API work.
- Opus: hard reasoning, complex architecture decisions, agentic coding where Sonnet fell short. Not the default.
In Claude Code, switch with /model haiku or /model sonnet. Each switch invalidates the cache for that session (see Lever 1) — worthwhile if the task genuinely fits a cheaper model, not worth it for one-turn cost savings on a long session.
The effort parameter
For Claude Opus 4.7 (which uses adaptive thinking by default), the Messages API’s effort parameter inside output_config accepts low, medium, high, xhigh, or max. Fewer thinking tokens at low or medium means lower cost on problems that don’t require maximal reasoning depth. (Source: Messages API reference)
Manual extended thinking via budget_tokens is no longer supported on Opus 4.7 — it returns a 400 error. budget_tokens is also deprecated on Sonnet 4.6 and Opus 4.6. Use effort instead.
Lever 3: Context discipline — the invisible per-session tax
Every token in the context window costs money on every turn. A bloated CLAUDE.md, a conversation history that hasn’t been cleared between unrelated tasks, or an over-broad read at session start — all of these multiply across every subsequent call.
CLAUDE.md size
Anthropic’s Claude Code memory docs state directly: “target under 200 lines per CLAUDE.md file. Longer files consume more context and reduce adherence.” (Source)
A 600-line CLAUDE.md loaded at every session is 3× the overhead of a lean one, on every turn, for every project. Trim to under 200 lines. Move scoped rules to .claude/rules/<scope>.md — Claude Code loads those only when matching files are in play.
Conversation history
In Claude Code, run /clear between unrelated tasks. The full conversation history doesn’t help a new task — it adds tokens that must be processed on every subsequent call without contributing to the answer.
Use /compact when the session has been productive and you want to continue: it summarizes and replaces history rather than nuking it. Use it near the 50% context mark, not as an emergency measure when the window is full.
What Tokenmin’s scanner sees here
The most common pattern in session logs: CLAUDE.md loaded at 400+ lines, context never cleared between tasks, and the same reference files re-read 3–4 times in a session because they fell out of the prior response and Claude needed to re-establish state. The compound effect of all three is typically 2–4× the per-session token cost of a clean workflow.
Lever 4: Tool call design — where agentic workloads lose money
Tool calls have two cost vectors: the tokens in the tool definition (which sit in the system prompt and get cached or not), and the tokens in the tool results (which are always new). Both can be reduced with deliberate design.
Consolidate reads
Anthropic’s own engineering team measured the effect of programmatic tool consolidation on their advanced tool use benchmark:
Tool search reduced context from 191,300 tokens to 122,800 on the same task — a 36% reduction.
Programmatic tool calling cut the average call from 43,588 to 27,297 tokens — a 37% reduction.— Anthropic, “Advanced tool use: Building an LLM-driven search tool”
The principle: don’t give Claude 20 specialized tools when 5 composable ones cover the same ground. Each tool definition adds tokens to the system prompt. Load only what the task needs.
If you have many MCP servers connected and can’t easily trim, set ENABLE_TOOL_SEARCH=auto in your environment. Claude Code’s runtime then loads tool definitions on demand rather than dumping all of them into the system prompt. This is the trigger that produced Anthropic’s measured 191,300 → 122,800 token recovery above (Anthropic, “Advanced tool use”). One env var.
Parallel tool calls
Claude 4 models support multiple tool_use blocks in a single assistant response. Running them in parallel rather than sequentially eliminates round-trip overhead on independent reads. From the docs:
“Tool calls in a single assistant turn are unordered. You can run them concurrently (Promise.all, asyncio.gather), sequentially, or in any order.”
— Anthropic, “Parallel tool use”
Return all tool results in a single user message. Separate messages for each result breaks the parallel behavior — Claude reverts to sequential calls.
The read-once hook pattern
Re-reading the same file multiple times in a session is a documented anti-pattern. One practitioner measured the effect with a simple hook that flags when Claude reads a file it already has in context: the hook eliminated a class of redundant reads that were adding hundreds of tokens per turn on file-heavy workflows. (Source: dev.to/boucle2026)
Subagents: cold cache, isolated context
Subagents start their own conversation with no cache hits on the first call. Don’t spawn a subagent to read two files — that cold-starts a fresh context for a task that doesn’t warrant it. Reserve subagents for (a) genuinely isolated work that would bloat the main session, or (b) parallel work you can’t get from a single agent’s tool calls.
When subagents are warranted, Anthropic’s engineering team measured their compression benefit:
“Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens.”
— Anthropic, “How we built our multi-agent research system” (Jun 13, 2025)
The return from a well-designed subagent is a condensed, distilled summary — often 1,000–2,000 tokens — that replaces the full exploration cost from your main context.
Lever 5: Output length control
Extended thinking
Extended thinking is off by default. On Opus 4.7, the effort parameter controls thinking depth. On Sonnet 4.6 and earlier Opus models, the now-deprecated budget_tokens parameter set a cap.
Key point: thinking tokens are billed even when not displayed. Setting display: "omitted" reduces perceived latency (server skips streaming thinking tokens) but does not reduce cost. You still pay for the thinking generated.
Don’t enable extended thinking by default. Enable it per-call on tasks where reasoning depth actually improves the output: multi-step math, formal logic, complex architecture decisions. Skip it for: summarization, data extraction, reformatting, file reads, single-fact lookups, boilerplate generation.
Response length
Claude calibrates response length to perceived task complexity. On tasks where you want a short answer, say so explicitly: “Result only. No explanation.” On tasks where you want structured output, specify the schema. Vague prompts generate long responses; precise prompts generate precise responses.
The CLAUDE.md and Skills refactor
A common source of excess tokens: a large CLAUDE.md that contains procedural instructions meant to be referenced rarely. These load on every session regardless of whether the task needs them.
If a block of CLAUDE.md instructions is only relevant for certain task types — say, a 30-line section on how to handle client billing queries — move it to a skill file under .claude/skills/. The skill loads only when triggered. Same instructions, far fewer idle-session tokens.
Lever 6: Batch API for offline workloads
The Message Batches API processes asynchronous requests at 50% lower cost than the standard API. It’s purpose-built for: content moderation at scale, bulk evaluations, dataset analysis, and any workload where you’re processing a large queue and don’t need immediate results. (Source: Anthropic batch processing docs)
This is not a latency optimization. Batches have no SLA on turnaround. Use it only when cost efficiency matters more than response time — typically back-end jobs that run overnight or on a schedule.
Run this audit on your own data
The article above tells you what to look for. Tokenmin tells you which of those you are actually paying for.
curl --proto '=https' --tlsv1.2 -fsSL https://tokenmin.ai/install.sh | bash
tokenmin --days 30 --out report.md
What it does:
- Walks your
~/.claude/session logs (Claude Code) or a chat export (claude.ai / Claude Desktop) - HMAC-hashes file paths, project names, MCP server names, custom agent names — anything identifying — with a 32-byte salt generated on first run
- Runs detection rules against the result
- Writes a ranked report covering what to fix, in what order, with an estimated savings figure on each finding
Current detector coverage (v0.4, May 2026): missing CLAUDE.md, oversized CLAUDE.md, obsolete CLAUDE.md references, long sessions without /clear, no hooks, repeated file reads, high redo signal (“actually do X instead”), long search bursts, no MCP servers configured, Opus overspend on routine work.
Detectors in flight for v0.5 (the levers in this guide that aren’t yet automated): cache hit ratio (Lever 1), mid-session /model switch cache flush (Lever 1), ENABLE_TOOL_SEARCH not set on multi-MCP setups (Lever 4), output style absent (Lever 5), subagent return-size overruns (Lever 4), 1M-context trap past 256K.
What it doesn’t do. Doesn’t read prompt text, doesn’t read tool results, doesn’t send anything off-machine in the default --out mode. The submit-to-hosted-engine mode is opt-in via --submit-url, and the snapshot it would send is inspectable beforehand with --snapshot snap.json so you can see the exact bytes.
The scanner is Apache-2.0 and audit-readable. The engine is proprietary (separate repo, same trust posture). Full security model: SECURITY.md.
Anti-patterns — what to stop doing
Caching on sporadic traffic. If queries come in less than once every 5 minutes, cache writes cost more than hits save. Check your traffic pattern first.
Opus as the default model. On file reads, reformatting, and routine lookups, Opus costs 15× more than Haiku per input token for equivalent output. Use it for problems that actually need it.
CLAUDE.md over 200 lines. Loads in full on every session. Every line past 200 is context overhead on every turn that doesn’t need it.
Spawning a subagent for a two-file read. Cold-starts a new cache. Use direct tool calls for small, bounded reads; reserve subagents for large, isolated work.
Sending tool results in separate user messages. Breaks parallel tool behavior. Claude reverts to sequential calls, adding round-trip cost to every multi-step task.
Dynamic content before stable content. Cache is prefix-matched. If the user’s message precedes your system prompt or reference document in the API call structure, nothing downstream gets cached.
Extended thinking left on by default. Thinking tokens are billed even when hidden. Turn it off at the API level; enable per-call on tasks that justify it.
Re-reading the same file multiple times in a session. The file is already in context. Re-reading adds tokens without new information. A read-once hook catches this automatically.
Sources & Attribution
Every claim in this guide ties back to a primary source. URLs were verified at publication time.
Tier 1 — Primary Anthropic documentation
| # | Source | URL | What it backs |
|---|---|---|---|
| 1 | Prompt Caching | https://platform.claude.com/docs/en/build-with-claude/prompt-caching | Cache pricing (0.1× hit, 1.25× write, 2× 1-hr write), model minimums, TTL rules, prefix-match mechanics |
| 2 | Models Overview | https://platform.claude.com/docs/en/about-claude/models/overview | Model pricing, latency labels, capability descriptions |
| 3 | Parallel Tool Use | https://platform.claude.com/docs/en/build-with-claude/tool-use/parallel-tool-use | Multiple tool_use blocks per response, single-message result rule |
| 4 | Message Batches API | https://platform.claude.com/docs/en/build-with-claude/batch-processing | 50% cost reduction, async processing |
| 5 | Claude Code: Memory | https://code.claude.com/docs/en/memory | 200-line CLAUDE.md guidance, path-scoped rules |
| 6 | Claude Code: Prompt Caching | https://code.claude.com/docs/en/prompt-caching | Cache invalidation triggers in Claude Code |
| 7 | Messages API | https://platform.claude.com/docs/en/api/messages | effort parameter, extended thinking deprecation |
Tier 2 — Anthropic engineering writing
| # | Source | URL | What it backs |
|---|---|---|---|
| 8 | “Building effective agents” (Dec 19, 2024) | https://www.anthropic.com/engineering/building-effective-agents | Latency/cost vs. task-performance tradeoff framing |
| 9 | “Advanced tool use: Building an LLM-driven search tool” | https://www.anthropic.com/engineering/advanced-tool-use | Tool search: 191,300 → 122,800 tokens (36% reduction); programmatic tool calling: 43,588 → 27,297 avg (37% reduction) |
| 10 | “How we built our multi-agent research system” (Jun 13, 2025) | https://www.anthropic.com/engineering/built-multi-agent-research-system | Subagent compression: “condensed, distilled summary (often 1,000–2,000 tokens)” |
| 11 | “Lessons from building Claude Code: prompt caching is everything” | https://claude.com/blog/lessons-from-building-claude-code-prompt-caching-is-everything | “A few percentage points of cache miss rate can dramatically affect cost and latency” |
Tier 3 — Named independent practitioners
| # | Source | URL | What it backs |
|---|---|---|---|
| 12 | Simon Willison, “Prompt caching with Claude” (Aug 14, 2024) | https://simonwillison.net/2024/Aug/14/prompt-caching-with-claude/ | Independent confirmation of 10% cache-hit cost; negative-ROI caveat for sub-5-minute traffic |
| 13 | dev.to/boucle2026, “Read-once: a Claude Code hook that stops redundant file reads” | https://dev.to/boucle2026/read-once-a-claude-code-hook-that-stops-redundant-file-reads-4bjk | Read-once hook pattern for eliminating redundant file reads |
Corrections from prior circulating versions:
- The canonical prompt caching blog URL is
claude.com/blog/prompt-caching, notanthropic.com/news/prompt-caching. The latter issues a 308 permanent redirect. All references in this guide use the canonical destination. - Manual extended thinking via
budget_tokensreturns a 400 error on Opus 4.7. Use theeffortparameter instead.budget_tokensis also deprecated on Sonnet 4.6 and Opus 4.6.
© 2026 Rick Watson / RMW Commerce Consulting. This compilation, its ranking, and the original commentary are copyrighted. The underlying techniques originate from Anthropic’s official Claude API and Claude Code documentation, supplemented by Anthropic engineering writing and named independent practitioners — see Sources & Attribution. Quoting brief excerpts with attribution is fine. Republishing the compilation in whole or in substantial part requires written permission: rick@rmwcommerce.com.