What is prompt caching?
Prompt caching stores your system prompt on the API server after the first request. Subsequent requests reuse the cached version at 90% discount on input tokens. If your system prompt is 2,000 tokens and you make 100 requests/day, you save $5.67/day on Claude Sonnet 4.
How it works
- First request: Full price. The system prompt is sent and cached server-side
- Next requests (within 5 min): 90% discount. Only the user message is charged at full price
- After 5 min idle: Cache expires. Next request recreates it
Python implementation (Anthropic SDK)
import anthropic
client = anthropic.Anthropic(
api_key="izzi-YOUR_KEY_HERE",
base_url="https://api.izziapi.com/anthropic"
)
# Define your system prompt — this gets cached
SYSTEM_PROMPT = """You are an expert Python code reviewer specializing in:
1. Security: SQL injection, XSS, SSRF, path traversal
2. Performance: N+1 queries, missing indexes, blocking calls
3. Best practices: Type hints, error handling, testing patterns
4. Architecture: SOLID principles, dependency injection, clean code
When reviewing, provide:
- Severity (Critical/Major/Minor/Nitpick)
- Line number
- Current code
- Suggested fix
- Explanation"""
def review_code(code: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=3000,
system=[{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}
}],
messages=[{
"role": "user",
"content": f"Review this code:\n\n{code}"
}]
)
# Log cache performance
usage = response.usage
cached = getattr(usage, 'cache_read_input_tokens', 0)
created = getattr(usage, 'cache_creation_input_tokens', 0)
if cached > 0:
savings = cached * 0.003 * 0.9 / 1_000_000 # 90% savings on cached tokens
print(f"💰 Cache HIT: {cached} tokens cached, saved ${savings:.4f}")
elif created > 0:
print(f"📝 Cache CREATED: {created} tokens (first call, full price)")
return response.content[0].textNode.js implementation (OpenAI SDK)
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: "izzi-YOUR_KEY_HERE",
baseURL: "https://api.izziapi.com/anthropic",
});
const SYSTEM_PROMPT = `You are a senior code reviewer...`; // Same as above
async function reviewCode(code: string): Promise<string> {
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 3000,
system: [{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
}],
messages: [{
role: "user",
content: `Review this code:\n\n${code}`,
}],
});
return response.content[0].type === "text" ? response.content[0].text : "";
}Cost comparison: with vs. without caching
Scenario: 2,000-token system prompt, 500 requests/day, Claude Sonnet 4 on Izzi API:
| Metric | Without caching | With caching |
|---|---|---|
| System prompt cost/day | $2.10 | $0.21 |
| User message cost/day | $1.05 | $1.05 |
| Total/day | $3.15 | $1.26 |
| Total/month (22 days) | $69.30 | $27.72 |
| Monthly savings | — | $41.58 (60%) |
Best practices
- Put static content first: System prompt, few-shot examples, documentation
- Keep cache warm: Make requests at least every 5 minutes during active use
- Minimum 1,024 tokens: Anthropic requires at least 1,024 tokens for caching
- Don't cache user messages: They change every request — caching is pointless
