Battle-tested in production. Build on it with confidence.
Prompt Caching
Table-stakes optimisation for any production LLM app with repeated context — but prefix ordering is fragile, and changing tools mid-conversation invalidates the cache.
Infrastructure·DevTool
anthropic.comOur Take
What It Is
Prompt caching avoids recomputing the same prompt prefix on every request. Anthropic offers explicit cache breakpoints (up to 4, 90% savings), OpenAI provides automatic caching (50% savings, no code changes), and Google Gemini supports both implicit and explicit caching. All require a minimum of ~1,024 tokens to activate. OpenAI recently added 24-hour extended retention.
Why It Matters
Prompt caching is Proven because it's become invisible infrastructure. Claude Code's entire architecture depends on it for conversation context. Any production LLM application with system prompts, tool definitions, or multi-turn conversations benefits immediately. One Anthropic customer saved $50,000+ per month on academic paper analysis.
The practical reality: if you're not using prompt caching in production, you're overpaying by 50-90%. It's one of the few optimisations that requires near-zero effort for substantial cost reduction.
Key Developments
- Jan 2026: arXiv paper "Don't Break the Cache" evaluates prompt caching for long-horizon agentic tasks.
- Early 2026: OpenAI adds 24-hour extended caching via prompt_cache_retention parameter.
- Late 2025: Gemini 2.5 enables auto-caching with guaranteed discounts on explicit context caching.
- Ongoing: All three major providers now support caching. Anthropic reports 85% latency reduction on long prompts.
What to Watch
The fragility of prefix ordering is the pain point. Adding or removing a tool, shuffling system prompt sections, or inserting timestamps breaks the cache entirely. Watch for providers to offer more flexible cache invalidation strategies — partial prefix matching or content-addressed caching would be a significant improvement over the current exact-prefix requirement.
Strengths
- Dramatic cost savings: 50-90% reduction on cached input tokens. Real-world savings of $50,000+/month on heavy workloads.
- Significant latency reduction: Anthropic reports 85% latency reduction on long prompts. OpenAI sees similar gains.
- Zero code changes (OpenAI): Automatic caching activates for any prompt exceeding 1,024 tokens.
- Critical for agentic workflows: Claude Code's architecture depends on it. Sequential API calls in agents benefit enormously.
Considerations
- Prefix ordering fragility: Cache hits require exact prefix matching. Shuffling tool definitions or adding timestamps invalidates the cache entirely.
- Cold start penalty: Cache misses incur full processing cost plus write overhead. Anthropic's 1-hour TTL writes cost 2x base input price.
- Tool changes break the cache: Adding or removing a tool mid-conversation invalidates the cache for the entire conversation.
- Provider implementation differences: Anthropic's explicit breakpoints, OpenAI's automatic approach, and Google's dual mode are fundamentally different APIs.
Resources
Documentation
More in Developer Experience
Prompt Caching· Gemini CLI· LiteLLM· Coding Agents· Cursor· Google Antigravity· OpenRouter· Windsurf· Xcode Agentic Coding· Claude Code· GitHub Copilot· OpenAI Codex
Back to AI Radar