Prompt Caching

Table-stakes optimisation for any production LLM app with repeated context — but prefix ordering is fragile, and changing tools mid-conversation invalidates the cache.

Infrastructure·DevTool

anthropic.com

Our Take

What It Is

Prompt caching avoids recomputing the same prompt prefix on every request. Anthropic offers explicit cache breakpoints (up to 4, 90% savings), OpenAI provides automatic caching (50% savings, no code changes), and Google Gemini supports both implicit and explicit caching. All require a minimum of ~1,024 tokens to activate. OpenAI recently added 24-hour extended retention.

Why It Matters

Prompt caching is Proven because it's become invisible infrastructure. Claude Code's entire architecture depends on it for conversation context. Any production LLM application with system prompts, tool definitions, or multi-turn conversations benefits immediately. One Anthropic customer saved $50,000+ per month on academic paper analysis.

The practical reality: if you're not using prompt caching in production, you're overpaying by 50-90%. It's one of the few optimisations that requires near-zero effort for substantial cost reduction.

Key Developments

Jan 2026: arXiv paper "Don't Break the Cache" evaluates prompt caching for long-horizon agentic tasks.
Early 2026: OpenAI adds 24-hour extended caching via prompt_cache_retention parameter.
Late 2025: Gemini 2.5 enables auto-caching with guaranteed discounts on explicit context caching.
Ongoing: All three major providers now support caching. Anthropic reports 85% latency reduction on long prompts.

What to Watch

The fragility of prefix ordering is the pain point. Adding or removing a tool, shuffling system prompt sections, or inserting timestamps breaks the cache entirely. Watch for providers to offer more flexible cache invalidation strategies — partial prefix matching or content-addressed caching would be a significant improvement over the current exact-prefix requirement.

Strengths

Dramatic cost savings: 50-90% reduction on cached input tokens. Real-world savings of $50,000+/month on heavy workloads.
Significant latency reduction: Anthropic reports 85% latency reduction on long prompts. OpenAI sees similar gains.
Zero code changes (OpenAI): Automatic caching activates for any prompt exceeding 1,024 tokens.
Critical for agentic workflows: Claude Code's architecture depends on it. Sequential API calls in agents benefit enormously.

Considerations

Prefix ordering fragility: Cache hits require exact prefix matching. Shuffling tool definitions or adding timestamps invalidates the cache entirely.
Cold start penalty: Cache misses incur full processing cost plus write overhead. Anthropic's 1-hour TTL writes cost 2x base input price.
Tool changes break the cache: Adding or removing a tool mid-conversation invalidates the cache for the entire conversation.
Provider implementation differences: Anthropic's explicit breakpoints, OpenAI's automatic approach, and Google's dual mode are fundamentally different APIs.

Resources

Documentation

Anthropic Prompt Cachinganthropic.com

Anthropic's announcement and documentation for explicit caching.

OpenAI Prompt Caching 201developers.openai.com

Advanced guide to OpenAI's automatic caching with extended retention.

Papers

Don't Break the Cache (arXiv)arxiv.org

January 2026 evaluation of caching for long-horizon agentic tasks.

Articles

How Prompt Caching Works in Claude Codeclaudecodecamp.com

Deep dive into how Claude Code uses caching internally.

More in Developer Experience

Prompt Caching· Gemini CLI· LiteLLM· Coding Agents· Cursor· Google Antigravity· OpenRouter· Windsurf· Xcode Agentic Coding· Claude Code· GitHub Copilot· OpenAI Codex

Back to AI Radar