How Prompt Caching Cuts Your API Costs by 90%

What is prompt caching?

Prompt caching stores your system prompt on the API server after the first request. Subsequent requests reuse the cached version at 90% discount on input tokens. If your system prompt is 2,000 tokens and you make 100 requests/day, you save $5.67/day on Claude Sonnet 4.

How it works

First request: Full price. The system prompt is sent and cached server-side
Next requests (within 5 min): 90% discount. Only the user message is charged at full price
After 5 min idle: Cache expires. Next request recreates it

Python implementation (Anthropic SDK)

Python

import anthropic

client = anthropic.Anthropic(
    api_key="izzi-YOUR_KEY_HERE",
    base_url="https://api.izziapi.com/anthropic"
)

# Define your system prompt — this gets cached
SYSTEM_PROMPT = """You are an expert Python code reviewer specializing in:
1. Security: SQL injection, XSS, SSRF, path traversal
2. Performance: N+1 queries, missing indexes, blocking calls
3. Best practices: Type hints, error handling, testing patterns
4. Architecture: SOLID principles, dependency injection, clean code

When reviewing, provide:
- Severity (Critical/Major/Minor/Nitpick)
- Line number
- Current code
- Suggested fix
- Explanation"""

def review_code(code: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=3000,
        system=[{
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[{
            "role": "user",
            "content": f"Review this code:\n\n{code}"
        }]
    )
    
    # Log cache performance
    usage = response.usage
    cached = getattr(usage, 'cache_read_input_tokens', 0)
    created = getattr(usage, 'cache_creation_input_tokens', 0)
    
    if cached > 0:
        savings = cached * 0.003 * 0.9 / 1_000_000  # 90% savings on cached tokens
        print(f"💰 Cache HIT: {cached} tokens cached, saved ${savings:.4f}")
    elif created > 0:
        print(f"📝 Cache CREATED: {created} tokens (first call, full price)")
    
    return response.content[0].text

Node.js implementation (OpenAI SDK)

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: "izzi-YOUR_KEY_HERE",
  baseURL: "https://api.izziapi.com/anthropic",
});

const SYSTEM_PROMPT = `You are a senior code reviewer...`; // Same as above

async function reviewCode(code: string): Promise<string> {
  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 3000,
    system: [{
      type: "text",
      text: SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" },
    }],
    messages: [{
      role: "user",
      content: `Review this code:\n\n${code}`,
    }],
  });

  return response.content[0].type === "text" ? response.content[0].text : "";
}

Cost comparison: with vs. without caching

Scenario: 2,000-token system prompt, 500 requests/day, Claude Sonnet 4 on Izzi API:

Metric	Without caching	With caching
System prompt cost/day	$2.10	$0.21
User message cost/day	$1.05	$1.05
Total/day	$3.15	$1.26
Total/month (22 days)	$69.30	$27.72
Monthly savings	—	$41.58 (60%)

Best practices

Put static content first: System prompt, few-shot examples, documentation
Keep cache warm: Make requests at least every 5 minutes during active use
Minimum 1,024 tokens: Anthropic requires at least 1,024 tokens for caching
Don't cache user messages: They change every request — caching is pointless