Caching, Deduplication & Cost Optimization
Reduce AI generation costs by 60–80% with prompt hashing, semantic deduplication, output TTL strategies, and budget guardrails.
What you'll be able to build after this module
A multi-layer caching system that serves repeat requests from cache, deduplicates near-identical prompts, and sets hard daily budget limits — reducing your monthly AI spend significantly without degrading user experience.
The cost problem
Without caching, every user request hits the API:
500 users × 5 generations/day × $0.04/prediction = $100/day = $3,000/monthMany of those predictions are for similar or identical prompts. Real-world observation: 20–40% of prompts are duplicates or near-duplicates in most consumer apps.
With a caching layer:
$3,000/month × 0.35 (35% cache hit rate) = $1,050 saved per monthLayer 1: Exact prompt caching
Hash the entire generation config (prompt + model + parameters) and store the output. Identical requests never hit the API.
// lib/generation-cache.ts
import { createHash } from 'crypto';
import { Redis } from '@upstash/redis';
const redis = Redis.fromEnv();
const TTL_SECONDS = 86_400 * 7; // Cache outputs for 7 days
interface CacheKey {
model: string;
prompt: string;
width?: number;
height?: number;
seed?: number;
}
function hashKey(key: CacheKey): string {
// Sort keys for deterministic hashing
return createHash('sha256')
.update(JSON.stringify(key, Object.keys(key).sort()))
.digest('hex');
}
export async function getCached(key: CacheKey): Promise<string[] | null> {
return redis.get<string[]>(`gen:${hashKey(key)}`);
}
export async function setCached(key: CacheKey, output: string[]): Promise<void> {
await redis.set(`gen:${hashKey(key)}`, output, { ex: TTL_SECONDS });
}
// Usage in your API route
export async function generateWithCache(params: CacheKey): Promise<string[]> {
// 1. Check cache
const cached = await getCached(params);
if (cached) {
console.log('Cache hit — saved $0.04');
return cached;
}
// 2. Generate
const prediction = await client.predictions.create({
model: params.model,
input: { prompt: params.prompt, width: params.width, height: params.height },
});
// 3. Store in cache
await setCached(params, prediction.output!);
return prediction.output!;
}Important: Only cache outputs that don't use a random seed. If seed is not set, two identical prompts generate different images — which is the expected behavior. When you want reproducible caching, set an explicit seed value.
Layer 2: Request coalescing (deduplication in-flight)
When two users submit the exact same prompt at the same time, without coalescing you pay for both. With coalescing, the second request waits for the first to complete and reuses its result.
// lib/coalesce.ts
const inflightRequests = new Map<string, Promise<string[]>>();
export async function generateCoalesced(
cacheKey: string,
generateFn: () => Promise<string[]>,
): Promise<string[]> {
// If there's already an in-flight request for this key, wait for it
const existing = inflightRequests.get(cacheKey);
if (existing) {
console.log('Coalesced — piggyback on existing request');
return existing;
}
// Start a new request and register it
const promise = generateFn().finally(() => {
inflightRequests.delete(cacheKey); // clean up when done
});
inflightRequests.set(cacheKey, promise);
return promise;
}Layer 3: CDN for output delivery
Skytells output URLs expire in 24 hours. Don't store them — store the outputs in your own CDN:
// lib/store-output.ts
import { put } from '@vercel/blob';
export async function storeAndGetPermanentUrl(
predictionId: string,
cdnUrl: string,
): Promise<string> {
// Download from Skytells CDN (expires in 24h)
const response = await fetch(cdnUrl);
const buffer = await response.arrayBuffer();
// Upload to your permanent storage
const { url } = await put(`generations/${predictionId}.png`, buffer, {
access: 'public',
contentType: 'image/png',
});
return url; // This URL never expires
}Skytells CDN URLs expire after 24 hours. If you store Skytells URLs in your database and show them to users days later, they'll get broken image links. Always re-host to your own storage before persisting the URL.
Budget guardrails
Set a hard daily spending limit that blocks requests once reached:
// lib/budget.ts
import { Redis } from '@upstash/redis';
const redis = Redis.fromEnv();
const MODEL_COSTS: Record<string, number> = {
'truefusion-edge': 0.01,
'truefusion': 0.02,
'truefusion-pro': 0.04,
'truefusion-2.0': 0.06,
'truefusion-ultra': 0.08,
'beatfusion-2.0': 0.75,
'truefusion-video-pro': 1.50,
};
const DAILY_LIMIT_USD = 50;
function todayKey() { return `budget:${new Date().toISOString().split('T')[0]}`; }
export async function checkAndRecordBudget(model: string): Promise<void> {
const cost = MODEL_COSTS[model] ?? 0.04;
const key = todayKey();
const current = parseFloat(await redis.get<string>(key) ?? '0');
if (current + cost > DAILY_LIMIT_USD) {
throw new Error(
`Daily budget exceeded ($${current.toFixed(2)} / $${DAILY_LIMIT_USD}). Service paused until midnight UTC.`
);
}
// Atomically increment
await redis.incrbyfloat(key, cost);
await redis.expire(key, 86_400 * 2);
}Cost tracking dashboard
Track spending by model and user to identify where money is going:
interface SpendEvent {
timestamp: string;
model: string;
userId: string;
costUsd: number;
cacheHit: boolean;
predictionId?: string;
}
async function trackSpend(event: SpendEvent) {
// Append to your analytics store (Postgres, ClickHouse, BigQuery, etc.)
await db.spendEvents.create({ data: event });
// Or just log structured JSON for ingestion by Datadog/Logtail/CloudWatch
console.log(JSON.stringify({ event_type: 'generation_cost', ...event }));
}Caching impact at scale
| Monthly volume | No cache | 30% hit rate | 50% hit rate |
|---|---|---|---|
| 10,000 predictions | $400 | $280 | $200 |
| 50,000 predictions | $2,000 | $1,400 | $1,000 |
| 200,000 predictions | $8,000 | $5,600 | $4,000 |
Assumes truefusion-pro at $0.04/prediction.
Summary
A well-implemented caching layer typically reduces AI generation costs by 30–50% with no visible impact on user experience.
The caching stack:
- Exact hash cache — Redis, 7-day TTL, covers identical prompts
- Request coalescing — prevents duplicate in-flight requests
- Permanent CDN storage — re-host Skytells outputs before storing URLs
- Budget guardrails — hard daily limit with Redis, checked before every prediction
- Spend tracking — structured events for attribution and optimization
Next: multi-model pipelines — chain image, video, and audio models for complex workflows.
Sync vs. Async Patterns
Choose the right generation architecture for every use case — real-time response, optimistic UI, background queue, and push notification patterns.
Multi-Model Pipelines
Compose image, video, and audio models into reliable production pipelines — fan-out, fan-in, sequential chains, and circuit breakers.