How Prompt Caching Works
Every LLM API request includes input tokens that the model processes before generating output. Many applications send the same system prompt, few-shot examples, or reference documents with every request. Without caching, you pay the full input token price every time.
Prompt caching tells the provider to store the processed state of your prompt prefix. Subsequent requests that share the same prefix reuse the cached version, reducing both cost and processing time.
Provider Caching Comparison
| Provider | Cache Discount | Min Cache Size | Cache Duration | Implementation |
|---|---|---|---|---|
| Anthropic | 90% off input | 1,024 tokens | 5 min (auto-extend on use) | Explicit cache_control blocks |
| OpenAI | 50% off input | 1,024 tokens | Automatic | Automatic (no code changes) |
| ~75% off input | 32,768 tokens | Configurable | CachedContent API |
Discounts and limits approximate. Last updated April 2026.
Real-World Savings Examples
Consider an application that sends a 4,000-token system prompt with every request, processing 100,000 requests per month with an additional 1,000 variable tokens per request:
| Scenario | Input Cost (Claude Sonnet 4) | Savings |
|---|---|---|
| No caching (500M tokens) | $1,500 | — |
| With caching (400M cached + 100M fresh) | $420 | 72% |
The 400M cached tokens cost only $0.30/1M instead of $3.00/1M (Anthropic's 90% discount), while the 100M variable tokens pay full price. Total savings: $1,080 per month from this single optimization.
Best Practices for Prompt Caching
- Put cacheable content first: System prompts, instructions, and reference documents should be at the beginning of your prompt since caching works on prefixes.
- Maximize the cached prefix: The more tokens you can cache, the bigger your savings. If you reference the same documents repeatedly, include them in the cached prefix.
- Keep cache warm: Most providers expire caches after minutes of inactivity. Ensure your request patterns keep the cache alive, or schedule periodic warm-up requests.
- Monitor cache hit rates: Track what percentage of your input tokens are served from cache. Aim for 60%+ cache hit rates for meaningful savings.
Combining Caching with Hybrid Routing
Prompt caching and hybrid routing are complementary optimizations. Token Landing applies caching automatically where supported, and routes requests to the most cost-effective model for each task. The combined effect can reduce your effective per-token costs by 70-90% compared to uncached, single-model usage.
For workloads with repetitive prompts and mixed task complexity, this combination delivers the most significant savings available in the current LLM API market.