Claude Token Optimization: What Actually Moves the Number

An evidence-backed guide to reducing Claude token spend, ranked by impact. Every claim cited to primary Anthropic documentation, official engineering writing, or cross-vendor confirmation.

© 2026 Rick Watson / RMW Commerce Consulting. This compilation, its ranking, and the original commentary are copyrighted. The underlying techniques originate from Anthropic’s official Claude API and Claude Code documentation, supplemented by Anthropic engineering writing and named independent practitioners — see Sources & Attribution. Quoting brief excerpts with attribution is fine. Republishing the compilation in whole or in substantial part requires written permission: rick@rmwcommerce.com.


TL;DR — what’s in it for you

If you use Claude via the API or Claude Code with any regularity, the patterns below will measurably reduce what you spend:

Where to spend your time, in priority order

Most readers should implement the first two and stop. They account for the largest share of preventable spend.

# Practice Why it matters Effort
1 Prompt caching — structure stable content first, mark with cache_control Cache hits bill at 0.1× input. Anthropic quotes up to 90% cost reduction on repeated workloads One-time refactor, 1–2 hrs
2 Model routing — Haiku for mechanical work, Sonnet as default, Opus only when needed Haiku costs $1/MTok input vs. Opus at $15/MTok — a 15× spread. Most tasks don’t need the expensive model Per-task discipline
3 Context discipline/clear between tasks, /compact near 50%, CLAUDE.md under 200 lines Context overhead applies to every turn. A 200-line CLAUDE.md that bloats to 600 is a cost increase on every session, invisible until you measure it 20 min to trim, ongoing
4 Tool call design — consolidate reads, use parallel calls, strip unused tools from the system prompt Anthropic’s own engineering measured 37% token reduction (43,588 → 27,297 avg per call) with programmatic tool consolidation Per-tool refactor
5 Output length control — prompt for concise responses; skip extended thinking on routine work Thinking tokens are billed even when not shown. Routine tasks get no quality gain from extended reasoning Prompt-level change
6 Batch API for async workloads — non-interactive processing at 50% lower cost Purpose-built for offline jobs. No SLA on timing, so only use when latency is irrelevant Architectural decision

How to use this

You don’t need to read all of it. The operational form is a one-line install that runs the audit against your actual Claude session data — no questions, no manual review:

curl --proto '=https' --tlsv1.2 -fsSL https://tokenmin.ai/install.sh | bash
tokenmin --days 30 --out report.md

Tokenmin reads your ~/.claude/ session logs, hashes anything identifying before any analysis runs, and produces a ranked report against the levers below — caching, model routing, context discipline, and the rest. About 60 seconds end-to-end.

Why it’s safe to install. The scanner that decides what (if anything) leaves your machine is Apache-2.0 at github.com/watsonrm/tokenmin-scanner — readable end-to-end before you trust it. Identifiers are HMAC-hashed with a per-install salt; nothing leaves the machine unless you pass --submit-url. See SECURITY.md for the threat model.

If you’d rather do this by hand: the claude-token-optimization skill runs the same audit via Q&A in a Claude Code session. Drop it into ~/.claude/skills/ and say “audit my token spend.” The article below is the reasoning behind each lever — read it for the why; either Tokenmin or the skill is the how.


The honest framing

Token cost is a function of context size × price per token × frequency. Anthropic’s own engineering team states directly:

“Agentic systems often trade latency and cost for better task performance.”

— Anthropic, “Building effective agents” (Dec 19, 2024)

Cost is not always a problem to solve. On high-value work, paying more for a slower, more capable model is correct. The goal is to stop paying for quality on tasks that don’t need it — and to stop reprocessing the same stable content on every call.

Tokenmin’s scanner reads your Claude session logs and surfaces where you’re paying the most. The rankings here align with what its detectors consistently find first in real workloads.


Lever 1: Prompt caching — the largest single reduction

What you’re paying without it

Every API call reprocesses every token you send. A 5,000-token system prompt in a session with 50 turns means 250,000 tokens of overhead before your actual questions touch the model. On any workload with repeated stable content — a reference document, a CLAUDE.md, a codebase read at session start — that overhead dominates the bill.

What caching does

You mark stable sections with cache_control. The API stores them server-side. Subsequent calls that hit the same prefix read from cache instead of reprocessing. Cache reads bill at 0.1× input; cache writes at 1.25× (5-min TTL) or 2× (1-hour TTL).

Anthropic’s published numbers: up to 90% cost savings on cached content. Their own blog post on building Claude Code calls prompt caching “everything”:

“A few percentage points of cache miss rate can dramatically affect cost and latency.”

— Anthropic, “Lessons from building Claude Code: prompt caching is everything

Pricing table (Claude Sonnet 4.6)

Token type Cost per million Multiplier vs. standard
Standard input $3.00
Cache write (5-min TTL) $3.75 1.25×
Cache write (1-hour TTL) $6.00
Cache hit $0.30 0.1×

(Source: Anthropic prompt caching docs)

The caveat

Cache is worth it on repeated workloads only. Simon Willison flagged this at launch:

“Apps prompting less frequently than once per five minutes will see negative ROI since cache writing costs exceed retrieval savings.”

— Simon Willison, “Prompt caching with Claude” (Aug 14, 2024)

Check your traffic pattern before enabling caching app-wide. Sporadic, one-off queries don’t benefit.

Model minimums

Cache kicks in at 1,024 tokens for Sonnet 4.6 and Sonnet 4.5. Opus 4.7, Opus 4.5, and Haiku 4.5 require 4,096 tokens before caching applies. Shorter prompts process without caching — no error returned, just no savings.

What belongs in the cache

Cache these: system prompts, tool definitions, reference documents, long few-shot blocks, code files read repeatedly at session start.

Don’t cache: the user’s current message, dynamic state, turn-specific tool results. Caching dynamic content provides no benefit and wastes the write cost.

API structure

{
  "model": "claude-sonnet-4-6",
  "system": [
    {
      "type": "text",
      "text": "You are a data analyst with these rules...",
      "cache_control": {"type": "ephemeral"}
    },
    {
      "type": "text",
      "text": "[5,000-token reference document here]",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Which SKUs had the highest return rate?"}
  ]
}

In Claude Code

Claude Code manages caching automatically. Your CLAUDE.md, tool definitions, and stable project context sit in the cached prefix. What invalidates the cache mid-session (source):

Pick model and connect MCP servers at session start. Save /compact for natural breaks between tasks.


Lever 2: Model routing — the simplest cost lever

The spread

Model Comparative latency Input (per MTok) Output (per MTok)
Claude Haiku 4.5 Fastest $1 $5
Claude Sonnet 4.6 Fast $3 $15
Claude Opus 4.7 Moderate $15 $75

(Source: Anthropic models overview)

Opus costs 15× more per input token than Haiku — and 15× more per output token too ($75/MTok vs. $5/MTok). On agentic workloads where output dominates the bill, the output spread is the bigger line item. For mechanical work — file reads, reformatting, simple lookups, yes/no checks — Haiku delivers the same result at a fraction of the cost on both sides.

The discipline

In Claude Code, switch with /model haiku or /model sonnet. Each switch invalidates the cache for that session (see Lever 1) — worthwhile if the task genuinely fits a cheaper model, not worth it for one-turn cost savings on a long session.

The effort parameter

For Claude Opus 4.7 (which uses adaptive thinking by default), the Messages API’s effort parameter inside output_config accepts low, medium, high, xhigh, or max. Fewer thinking tokens at low or medium means lower cost on problems that don’t require maximal reasoning depth. (Source: Messages API reference)

Manual extended thinking via budget_tokens is no longer supported on Opus 4.7 — it returns a 400 error. budget_tokens is also deprecated on Sonnet 4.6 and Opus 4.6. Use effort instead.


Lever 3: Context discipline — the invisible per-session tax

Every token in the context window costs money on every turn. A bloated CLAUDE.md, a conversation history that hasn’t been cleared between unrelated tasks, or an over-broad read at session start — all of these multiply across every subsequent call.

CLAUDE.md size

Anthropic’s Claude Code memory docs state directly: “target under 200 lines per CLAUDE.md file. Longer files consume more context and reduce adherence.” (Source)

A 600-line CLAUDE.md loaded at every session is 3× the overhead of a lean one, on every turn, for every project. Trim to under 200 lines. Move scoped rules to .claude/rules/<scope>.md — Claude Code loads those only when matching files are in play.

Conversation history

In Claude Code, run /clear between unrelated tasks. The full conversation history doesn’t help a new task — it adds tokens that must be processed on every subsequent call without contributing to the answer.

Use /compact when the session has been productive and you want to continue: it summarizes and replaces history rather than nuking it. Use it near the 50% context mark, not as an emergency measure when the window is full.

What Tokenmin’s scanner sees here

The most common pattern in session logs: CLAUDE.md loaded at 400+ lines, context never cleared between tasks, and the same reference files re-read 3–4 times in a session because they fell out of the prior response and Claude needed to re-establish state. The compound effect of all three is typically 2–4× the per-session token cost of a clean workflow.


Lever 4: Tool call design — where agentic workloads lose money

Tool calls have two cost vectors: the tokens in the tool definition (which sit in the system prompt and get cached or not), and the tokens in the tool results (which are always new). Both can be reduced with deliberate design.

Consolidate reads

Anthropic’s own engineering team measured the effect of programmatic tool consolidation on their advanced tool use benchmark:

Tool search reduced context from 191,300 tokens to 122,800 on the same task — a 36% reduction.
Programmatic tool calling cut the average call from 43,588 to 27,297 tokens — a 37% reduction.

— Anthropic, “Advanced tool use: Building an LLM-driven search tool

The principle: don’t give Claude 20 specialized tools when 5 composable ones cover the same ground. Each tool definition adds tokens to the system prompt. Load only what the task needs.

If you have many MCP servers connected and can’t easily trim, set ENABLE_TOOL_SEARCH=auto in your environment. Claude Code’s runtime then loads tool definitions on demand rather than dumping all of them into the system prompt. This is the trigger that produced Anthropic’s measured 191,300 → 122,800 token recovery above (Anthropic, “Advanced tool use”). One env var.

Parallel tool calls

Claude 4 models support multiple tool_use blocks in a single assistant response. Running them in parallel rather than sequentially eliminates round-trip overhead on independent reads. From the docs:

“Tool calls in a single assistant turn are unordered. You can run them concurrently (Promise.all, asyncio.gather), sequentially, or in any order.”

— Anthropic, “Parallel tool use

Return all tool results in a single user message. Separate messages for each result breaks the parallel behavior — Claude reverts to sequential calls.

The read-once hook pattern

Re-reading the same file multiple times in a session is a documented anti-pattern. One practitioner measured the effect with a simple hook that flags when Claude reads a file it already has in context: the hook eliminated a class of redundant reads that were adding hundreds of tokens per turn on file-heavy workflows. (Source: dev.to/boucle2026)

Subagents: cold cache, isolated context

Subagents start their own conversation with no cache hits on the first call. Don’t spawn a subagent to read two files — that cold-starts a fresh context for a task that doesn’t warrant it. Reserve subagents for (a) genuinely isolated work that would bloat the main session, or (b) parallel work you can’t get from a single agent’s tool calls.

When subagents are warranted, Anthropic’s engineering team measured their compression benefit:

“Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens.”

— Anthropic, “How we built our multi-agent research system” (Jun 13, 2025)

The return from a well-designed subagent is a condensed, distilled summary — often 1,000–2,000 tokens — that replaces the full exploration cost from your main context.


Lever 5: Output length control

Extended thinking

Extended thinking is off by default. On Opus 4.7, the effort parameter controls thinking depth. On Sonnet 4.6 and earlier Opus models, the now-deprecated budget_tokens parameter set a cap.

Key point: thinking tokens are billed even when not displayed. Setting display: "omitted" reduces perceived latency (server skips streaming thinking tokens) but does not reduce cost. You still pay for the thinking generated.

Don’t enable extended thinking by default. Enable it per-call on tasks where reasoning depth actually improves the output: multi-step math, formal logic, complex architecture decisions. Skip it for: summarization, data extraction, reformatting, file reads, single-fact lookups, boilerplate generation.

Response length

Claude calibrates response length to perceived task complexity. On tasks where you want a short answer, say so explicitly: “Result only. No explanation.” On tasks where you want structured output, specify the schema. Vague prompts generate long responses; precise prompts generate precise responses.

The CLAUDE.md and Skills refactor

A common source of excess tokens: a large CLAUDE.md that contains procedural instructions meant to be referenced rarely. These load on every session regardless of whether the task needs them.

If a block of CLAUDE.md instructions is only relevant for certain task types — say, a 30-line section on how to handle client billing queries — move it to a skill file under .claude/skills/. The skill loads only when triggered. Same instructions, far fewer idle-session tokens.


Lever 6: Batch API for offline workloads

The Message Batches API processes asynchronous requests at 50% lower cost than the standard API. It’s purpose-built for: content moderation at scale, bulk evaluations, dataset analysis, and any workload where you’re processing a large queue and don’t need immediate results. (Source: Anthropic batch processing docs)

This is not a latency optimization. Batches have no SLA on turnaround. Use it only when cost efficiency matters more than response time — typically back-end jobs that run overnight or on a schedule.


Run this audit on your own data

The article above tells you what to look for. Tokenmin tells you which of those you are actually paying for.

curl --proto '=https' --tlsv1.2 -fsSL https://tokenmin.ai/install.sh | bash
tokenmin --days 30 --out report.md

What it does:

Current detector coverage (v0.4, May 2026): missing CLAUDE.md, oversized CLAUDE.md, obsolete CLAUDE.md references, long sessions without /clear, no hooks, repeated file reads, high redo signal (“actually do X instead”), long search bursts, no MCP servers configured, Opus overspend on routine work.

Detectors in flight for v0.5 (the levers in this guide that aren’t yet automated): cache hit ratio (Lever 1), mid-session /model switch cache flush (Lever 1), ENABLE_TOOL_SEARCH not set on multi-MCP setups (Lever 4), output style absent (Lever 5), subagent return-size overruns (Lever 4), 1M-context trap past 256K.

What it doesn’t do. Doesn’t read prompt text, doesn’t read tool results, doesn’t send anything off-machine in the default --out mode. The submit-to-hosted-engine mode is opt-in via --submit-url, and the snapshot it would send is inspectable beforehand with --snapshot snap.json so you can see the exact bytes.

The scanner is Apache-2.0 and audit-readable. The engine is proprietary (separate repo, same trust posture). Full security model: SECURITY.md.


Anti-patterns — what to stop doing

Caching on sporadic traffic. If queries come in less than once every 5 minutes, cache writes cost more than hits save. Check your traffic pattern first.

Opus as the default model. On file reads, reformatting, and routine lookups, Opus costs 15× more than Haiku per input token for equivalent output. Use it for problems that actually need it.

CLAUDE.md over 200 lines. Loads in full on every session. Every line past 200 is context overhead on every turn that doesn’t need it.

Spawning a subagent for a two-file read. Cold-starts a new cache. Use direct tool calls for small, bounded reads; reserve subagents for large, isolated work.

Sending tool results in separate user messages. Breaks parallel tool behavior. Claude reverts to sequential calls, adding round-trip cost to every multi-step task.

Dynamic content before stable content. Cache is prefix-matched. If the user’s message precedes your system prompt or reference document in the API call structure, nothing downstream gets cached.

Extended thinking left on by default. Thinking tokens are billed even when hidden. Turn it off at the API level; enable per-call on tasks that justify it.

Re-reading the same file multiple times in a session. The file is already in context. Re-reading adds tokens without new information. A read-once hook catches this automatically.


Sources & Attribution

Every claim in this guide ties back to a primary source. URLs were verified at publication time.

Tier 1 — Primary Anthropic documentation

# Source URL What it backs
1 Prompt Caching https://platform.claude.com/docs/en/build-with-claude/prompt-caching Cache pricing (0.1× hit, 1.25× write, 2× 1-hr write), model minimums, TTL rules, prefix-match mechanics
2 Models Overview https://platform.claude.com/docs/en/about-claude/models/overview Model pricing, latency labels, capability descriptions
3 Parallel Tool Use https://platform.claude.com/docs/en/build-with-claude/tool-use/parallel-tool-use Multiple tool_use blocks per response, single-message result rule
4 Message Batches API https://platform.claude.com/docs/en/build-with-claude/batch-processing 50% cost reduction, async processing
5 Claude Code: Memory https://code.claude.com/docs/en/memory 200-line CLAUDE.md guidance, path-scoped rules
6 Claude Code: Prompt Caching https://code.claude.com/docs/en/prompt-caching Cache invalidation triggers in Claude Code
7 Messages API https://platform.claude.com/docs/en/api/messages effort parameter, extended thinking deprecation

Tier 2 — Anthropic engineering writing

# Source URL What it backs
8 “Building effective agents” (Dec 19, 2024) https://www.anthropic.com/engineering/building-effective-agents Latency/cost vs. task-performance tradeoff framing
9 “Advanced tool use: Building an LLM-driven search tool” https://www.anthropic.com/engineering/advanced-tool-use Tool search: 191,300 → 122,800 tokens (36% reduction); programmatic tool calling: 43,588 → 27,297 avg (37% reduction)
10 “How we built our multi-agent research system” (Jun 13, 2025) https://www.anthropic.com/engineering/built-multi-agent-research-system Subagent compression: “condensed, distilled summary (often 1,000–2,000 tokens)”
11 “Lessons from building Claude Code: prompt caching is everything” https://claude.com/blog/lessons-from-building-claude-code-prompt-caching-is-everything “A few percentage points of cache miss rate can dramatically affect cost and latency”

Tier 3 — Named independent practitioners

# Source URL What it backs
12 Simon Willison, “Prompt caching with Claude” (Aug 14, 2024) https://simonwillison.net/2024/Aug/14/prompt-caching-with-claude/ Independent confirmation of 10% cache-hit cost; negative-ROI caveat for sub-5-minute traffic
13 dev.to/boucle2026, “Read-once: a Claude Code hook that stops redundant file reads” https://dev.to/boucle2026/read-once-a-claude-code-hook-that-stops-redundant-file-reads-4bjk Read-once hook pattern for eliminating redundant file reads

Corrections from prior circulating versions:


© 2026 Rick Watson / RMW Commerce Consulting. This compilation, its ranking, and the original commentary are copyrighted. The underlying techniques originate from Anthropic’s official Claude API and Claude Code documentation, supplemented by Anthropic engineering writing and named independent practitioners — see Sources & Attribution. Quoting brief excerpts with attribution is fine. Republishing the compilation in whole or in substantial part requires written permission: rick@rmwcommerce.com.