Caching, Deduplication & Cost Optimization

Reduce AI generation costs by 60–80% with prompt hashing, semantic deduplication, output TTL strategies, and budget guardrails.

What you'll be able to build after this module

A multi-layer caching system that serves repeat requests from cache, deduplicates near-identical prompts, and sets hard daily budget limits — reducing your monthly AI spend significantly without degrading user experience.


The cost problem

Without caching, every user request hits the API:

500 users × 5 generations/day × $0.04/prediction = $100/day = $3,000/month

Many of those predictions are for similar or identical prompts. Real-world observation: 20–40% of prompts are duplicates or near-duplicates in most consumer apps.

With a caching layer:

$3,000/month × 0.35 (35% cache hit rate) = $1,050 saved per month

Layer 1: Exact prompt caching

Hash the entire generation config (prompt + model + parameters) and store the output. Identical requests never hit the API.

// lib/generation-cache.ts
import { createHash } from 'crypto';
import { Redis } from '@upstash/redis';

const redis = Redis.fromEnv();
const TTL_SECONDS = 86_400 * 7; // Cache outputs for 7 days

interface CacheKey {
  model: string;
  prompt: string;
  width?: number;
  height?: number;
  seed?: number;
}

function hashKey(key: CacheKey): string {
  // Sort keys for deterministic hashing
  return createHash('sha256')
    .update(JSON.stringify(key, Object.keys(key).sort()))
    .digest('hex');
}

export async function getCached(key: CacheKey): Promise<string[] | null> {
  return redis.get<string[]>(`gen:${hashKey(key)}`);
}

export async function setCached(key: CacheKey, output: string[]): Promise<void> {
  await redis.set(`gen:${hashKey(key)}`, output, { ex: TTL_SECONDS });
}

// Usage in your API route
export async function generateWithCache(params: CacheKey): Promise<string[]> {
  // 1. Check cache
  const cached = await getCached(params);
  if (cached) {
    console.log('Cache hit — saved $0.04');
    return cached;
  }

  // 2. Generate
  const prediction = await client.predictions.create({
    model: params.model,
    input: { prompt: params.prompt, width: params.width, height: params.height },
  });

  // 3. Store in cache
  await setCached(params, prediction.output!);
  return prediction.output!;
}

Layer 2: Request coalescing (deduplication in-flight)

When two users submit the exact same prompt at the same time, without coalescing you pay for both. With coalescing, the second request waits for the first to complete and reuses its result.

// lib/coalesce.ts
const inflightRequests = new Map<string, Promise<string[]>>();

export async function generateCoalesced(
  cacheKey: string,
  generateFn: () => Promise<string[]>,
): Promise<string[]> {
  // If there's already an in-flight request for this key, wait for it
  const existing = inflightRequests.get(cacheKey);
  if (existing) {
    console.log('Coalesced — piggyback on existing request');
    return existing;
  }

  // Start a new request and register it
  const promise = generateFn().finally(() => {
    inflightRequests.delete(cacheKey); // clean up when done
  });

  inflightRequests.set(cacheKey, promise);
  return promise;
}

Layer 3: CDN for output delivery

Skytells output URLs expire in 24 hours. Don't store them — store the outputs in your own CDN:

// lib/store-output.ts
import { put } from '@vercel/blob';

export async function storeAndGetPermanentUrl(
  predictionId: string,
  cdnUrl: string,
): Promise<string> {
  // Download from Skytells CDN (expires in 24h)
  const response = await fetch(cdnUrl);
  const buffer = await response.arrayBuffer();

  // Upload to your permanent storage
  const { url } = await put(`generations/${predictionId}.png`, buffer, {
    access: 'public',
    contentType: 'image/png',
  });

  return url; // This URL never expires
}

Budget guardrails

Set a hard daily spending limit that blocks requests once reached:

// lib/budget.ts
import { Redis } from '@upstash/redis';

const redis = Redis.fromEnv();

const MODEL_COSTS: Record<string, number> = {
  'truefusion-edge': 0.01,
  'truefusion': 0.02,
  'truefusion-pro': 0.04,
  'truefusion-2.0': 0.06,
  'truefusion-ultra': 0.08,
  'beatfusion-2.0': 0.75,
  'truefusion-video-pro': 1.50,
};

const DAILY_LIMIT_USD = 50;

function todayKey() { return `budget:${new Date().toISOString().split('T')[0]}`; }

export async function checkAndRecordBudget(model: string): Promise<void> {
  const cost = MODEL_COSTS[model] ?? 0.04;
  const key = todayKey();

  const current = parseFloat(await redis.get<string>(key) ?? '0');

  if (current + cost > DAILY_LIMIT_USD) {
    throw new Error(
      `Daily budget exceeded ($${current.toFixed(2)} / $${DAILY_LIMIT_USD}). Service paused until midnight UTC.`
    );
  }

  // Atomically increment
  await redis.incrbyfloat(key, cost);
  await redis.expire(key, 86_400 * 2);
}

Cost tracking dashboard

Track spending by model and user to identify where money is going:

interface SpendEvent {
  timestamp: string;
  model: string;
  userId: string;
  costUsd: number;
  cacheHit: boolean;
  predictionId?: string;
}

async function trackSpend(event: SpendEvent) {
  // Append to your analytics store (Postgres, ClickHouse, BigQuery, etc.)
  await db.spendEvents.create({ data: event });

  // Or just log structured JSON for ingestion by Datadog/Logtail/CloudWatch
  console.log(JSON.stringify({ event_type: 'generation_cost', ...event }));
}

Caching impact at scale

Monthly volumeNo cache30% hit rate50% hit rate
10,000 predictions$400$280$200
50,000 predictions$2,000$1,400$1,000
200,000 predictions$8,000$5,600$4,000

Assumes truefusion-pro at $0.04/prediction.


Summary

The caching stack:

  1. Exact hash cache — Redis, 7-day TTL, covers identical prompts
  2. Request coalescing — prevents duplicate in-flight requests
  3. Permanent CDN storage — re-host Skytells outputs before storing URLs
  4. Budget guardrails — hard daily limit with Redis, checked before every prediction
  5. Spend tracking — structured events for attribution and optimization

Next: multi-model pipelines — chain image, video, and audio models for complex workflows.

On this page