Caching
Reduce costs and latency by caching identical requests.
Caching
LLM Gateway provides intelligent response caching that can significantly reduce your API costs and response latency. When caching is enabled, identical requests are served from cache instead of making redundant calls to LLM providers.
How It Works
When you make an API request:
- LLM Gateway generates a cache key based on the request parameters
- If a matching cached response exists, it's returned immediately
- If no cache exists, the request is forwarded to the provider
- The response is cached for future identical requests
This means repeated identical requests are served instantly from cache without incurring additional provider costs.
Cost Savings
Caching can dramatically reduce costs for applications with repetitive requests:
| Scenario | Without Caching | With Caching | Savings |
|---|---|---|---|
| 1,000 identical requests | $10.00 | $0.01 | 99.9% |
| 50% duplicate rate | $10.00 | $5.00 | 50% |
| Retry after transient error | $0.02 | $0.01 | 50% |
Cached responses are free from provider costs. You only pay for the initial request that populates the cache.
Requirements
Caching requires Data Retention to be enabled with "Retain All Data" level. This allows LLM Gateway to store and retrieve response payloads.
To use caching:
- Enable Data Retention in your organization settings with "Retain All Data" level
- Enable Caching in your project settings under Preferences
- Configure the cache duration (TTL) as needed
- Make requests as normal—caching is automatic
Cache Key Generation
The cache key is generated from these request parameters:
- Model identifier
- Messages array (roles and content)
- Temperature
- Max tokens
- Top P
- Tools/functions
- Tool choice
- Response format
- System prompt
- Other model-specific parameters
Requests with different parameter values, even slight variations, will not share cache entries.
Cache Behavior
Cache Hits
When a cache hit occurs:
- Response is returned immediately (sub-millisecond latency)
- No provider API call is made
- No inference costs are incurred
Cache Misses
When a cache miss occurs:
- Request is forwarded to the LLM provider
- Response is stored in cache
- Normal inference costs apply
- Future identical requests will hit the cache
Streaming and Caching
Caching works with both streaming and non-streaming requests:
- Non-streaming: Full response is cached and returned
- Streaming: The complete response is reconstructed from cache and streamed back
Cache TTL (Time-to-Live)
Cache duration is configurable per project in your project settings. You can set the cache TTL from 10 seconds up to 1 year (31,536,000 seconds).
The default cache duration is 60 seconds. Adjust this based on your use case—longer durations work well for static content, while shorter durations are better for frequently changing data.
Identifying Cached Responses
Cached responses show zero or minimal token usage since no inference occurred:
{
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0,
"cost_usd_total": 0
}
}Use Cases
Development and Testing
During development, you often send the same prompts repeatedly:
// This prompt will only incur costs once
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Explain quantum computing" }],
});Chatbots with Common Questions
FAQ-style interactions often have repeated questions:
// Common questions are served from cache
const faqs = [
"What are your business hours?",
"How do I reset my password?",
"What is your return policy?",
];Batch Processing
Processing large datasets with potentially duplicate items:
// Duplicate items in batch are served from cache
for (const item of items) {
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: `Classify: ${item}` }],
});
}Best Practices
Maximize Cache Hits
- Use consistent prompt formatting
- Normalize input data before sending
- Use deterministic parameters (temperature: 0)
- Avoid including timestamps or random values in prompts
Appropriate Use Cases
Caching is most effective for:
- Static knowledge queries
- Classification tasks
- FAQ responses
- Development/testing
- Retry scenarios
When to Avoid Caching
Caching may not be suitable for:
- Real-time data requirements
- Highly personalized responses
- Time-sensitive information
- Creative tasks requiring variety
Storage Costs
Since caching requires data retention, storage costs apply:
- Rate: $0.01 per 1 million tokens
- Applies to: All tokens in cached requests and responses
See Data Retention for complete pricing details.
The cost savings from caching typically far outweigh the storage costs, especially for applications with high request duplication.
How is this guide?
Last updated on