Advanced45 minModule 6 of 6

Building Production Agents

Deploy reliable AI agents with cost optimization, testing strategies, observability, and battle-tested deployment patterns.

What you'll learn in this module

  • How to make agents reliable with retries, fallbacks, and graceful degradation
  • How to control and optimize costs in production
  • How to test agents systematically before and after deployment
  • How to deploy with observability and monitoring

Reliability Engineering

Production agents fail. External APIs go down, LLMs hallucinate, rate limits hit. Reliability engineering is about designing for these failures.

The reliability hierarchy

Level 1: Input validation— Reject bad inputs early Level 2: Retries with backoff— Handle transient failures Level 3: Fallback models— If primary model fails, use backup Level 4: Graceful degradation— Return partial results over nothing Level 5: Human escalation— Route to humans when confidence is low

Retry strategies

Not all failures are equal. Match your retry strategy to the failure type:

Failure typeRetry?Strategy
Rate limit (429)YesExponential backoff with jitter
TimeoutYes (once)Increase timeout; if still fails, use fallback
Invalid output formatYesRetry with stricter prompt; add format examples
HallucinationNoSwitch to evaluation + retry with feedback
Auth failure (401)NoFix credentials; don't retry
Input too largeNoTruncate or chunk input; don't retry as-is

Exponential backoff implementation

async function withRetry<T>(
  fn: () => Promise<T>,
  maxRetries = 3,
  baseDelay = 1000,
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries) throw error;

      const isRetryable =
        error.status === 429 || error.status === 503 || error.code === "TIMEOUT";

      if (!isRetryable) throw error;

      // Exponential backoff with jitter
      const delay = baseDelay * Math.pow(2, attempt) + Math.random() * 1000;
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }
  throw new Error("Unreachable");
}

Fallback chains

When the primary model is unavailable or too slow, fall back to alternatives:

Success Failure Success Failure Hit Miss Request Primary: GPT-4-class Response Fallback 1: GPT-3.5-class Fallback 2: Cached response Graceful error message

Cost Optimization

AI agents can be expensive. A poorly designed agent might make 50 LLM calls to answer a simple question. Systematic cost optimization keeps budgets under control.

Cost drivers

FactorImpactMitigation
Input tokensProportional to context sizeTrim context, use summaries, compress history
Output tokensProportional to response lengthSet max_tokens, ask for concise responses
Number of callsMultiplied per step × retriesCache results, limit iterations, use routing
Model tier10-50x cost difference between tiersRoute simple tasks to cheaper models

The model routing pattern

Not every task needs the most powerful model. Route based on complexity:

Simple Moderate Complex Input Classifier(cheap, fast model) Small model~$0.001/call Medium model~$0.01/call Large model~$0.06/call

Caching strategies

Cache typeWhat to cacheWhen to use
Exact matchIdentical inputs → cached outputsFAQ-style queries, repeated classification
Semantic cacheSimilar inputs (by embedding distance) → cached outputsSearch queries, summaries of similar content
Result cacheTool/API call resultsExternal data that doesn't change frequently

Cost monitoring

Track these metrics per agent, per day:

MetricWhy
Total token usageOverall cost tracking
Tokens per taskEfficiency — is it getting worse over time?
Calls per taskAgent "chattiness" — too many round trips
Cache hit rateIs caching working?
Cost per successful outcomeThe metric that matters

Testing Strategies

Agents are non-deterministic. You can't write a unit test that asserts exact output. Instead, use a layered testing approach.

Testing pyramid for agents

Unit Tests— Tool functions, parsers, validators— Deterministic, fast Integration Tests— Tool + LLM interaction— Mock LLM responses, real tools Evaluation Tests— Full agent on test dataset— Measure accuracy, latency, cost Canary Tests— Live traffic subset— Compare against baseline

Unit tests

Test everything that doesn't involve an LLM call:

// Test tool adapters
test("listOpenIssues returns simplified format", async () => {
  const issues = await listOpenIssues("org/repo");
  expect(issues[0]).toHaveProperty("number");
  expect(issues[0]).toHaveProperty("title");
  expect(issues[0]).not.toHaveProperty("body"); // Should be stripped
});

// Test input validation
test("rejects invalid repo format", async () => {
  await expect(listOpenIssues("not-a-repo")).rejects.toThrow("owner/repo");
});

// Test output parsers
test("parseAgentResponse handles malformed JSON", () => {
  const result = parseAgentResponse("```json\n{broken\n```");
  expect(result.success).toBe(false);
  expect(result.error).toContain("JSON");
});

Evaluation tests

Build a test dataset that covers your agent's expected scenarios:

[
  {
    "input": "What are the open bugs for project-x?",
    "expected_tools": ["list_open_issues"],
    "expected_contains": ["bug", "project-x"],
    "max_tool_calls": 3,
    "max_latency_ms": 5000
  },
  {
    "input": "Summarize last week's standup notes",
    "expected_tools": ["search_documents", "summarize"],
    "expected_contains": ["standup", "summary"],
    "max_tool_calls": 5,
    "max_latency_ms": 10000
  }
]

Run the evaluation:

MetricWhat to measureAcceptable threshold
Task completion rateDid the agent produce a valid answer?> 95%
Tool selection accuracyDid it use the right tools?> 90%
Output quality (LLM-judge)Score from the evaluation LLM> 4.0/5.0
LatencyTime from input to final output< P95 target
Cost per taskTotal token spend< budget per task

Regression testing

After every change to prompts, tools, or models:

  1. Run the full evaluation test suite
  2. Compare metrics against the previous baseline
  3. Flag any regression > 5% in key metrics
  4. Investigate before deploying

Observability

You can't fix what you can't see. Agent observability means capturing enough data to diagnose failures and optimize performance.

What to log

DataPurpose
Full conversation traceReproduce the exact agent path
Per-step latencyIdentify bottleneck steps
Token usage per callCost attribution
Tool inputs and outputsDebug tool failures
LLM response metadataModel version, finish reason, token counts
Evaluation scoresTrack quality over time

Tracing structure

Agent Execution Trace Trace: task-4821 Span: Plan12ms, 450 tokens Span: search_web340ms Span: Synthesize890ms, 1200 tokens Span: Evaluate450ms, 800 tokens

Each span captures:

  • Start/end timestamp
  • Operation type (LLM call, tool call, evaluation)
  • Token usage (for LLM calls)
  • Input/output (optionally sampled for privacy)
  • Error information (if any)

Dashboards

Build dashboards around these four views:

DashboardShowsAlert if
HealthSuccess rate, error rate, latency P50/P95/P99Error rate > 5%
CostDaily spend, tokens per task, cost per outcomeSpend > daily budget
QualityEval scores, user feedback, task completion rateQuality score drops > 10%
ThroughputTasks per hour, queue depth, concurrent executionsQueue depth growing

Deployment Patterns

Canary deployment

Roll out changes to a small percentage of traffic first:

95% 5% Yes No Traffic Router Current Agent v1 New Agent v2 Metrics Collector v2 metricswithin tolerance? Increase to 50% → 100% Roll back v2

Shadow mode

Run the new agent in parallel without serving its results to users:

  1. Send every request to both v1 and v2
  2. Serve v1's response to the user
  3. Log and evaluate v2's response silently
  4. Compare quality metrics before switching

Feature flags

Wrap agent capabilities behind feature flags:

async function handleTask(task: Task) {
  if (featureFlags.get("use_new_planner")) {
    return await newPlannerAgent(task);
  }
  return await currentAgent(task);
}

This lets you enable new capabilities for specific users, teams, or a percentage of traffic.


Production Checklist

Before deploying any agent to production:

Reliability

  • Retry logic with exponential backoff for transient failures
  • Fallback model or cached response for complete outages
  • Iteration limits on all loops (max 10-15 iterations)
  • Timeout on all external calls
  • Input validation at system boundary

Cost

  • Token budget per task
  • Model routing (cheap model for simple tasks)
  • Caching for repeated queries
  • Daily spend alerts

Testing

  • Unit tests for all tool adapters
  • Evaluation dataset with 50+ test cases
  • Quality baseline established
  • Regression tests passing

Observability

  • Execution tracing enabled
  • Dashboards for health, cost, quality, throughput
  • Alerts configured for error rate and cost spikes
  • Log retention policy set

Security

  • Tools have minimal permissions (principle of least privilege)
  • User input is sanitized before inclusion in prompts
  • Sensitive data redacted from logs
  • Agent scope boundaries documented and enforced

What you now understand

AreaKey takeaway
ReliabilityFive levels: validation → retries → fallbacks → degradation → escalation
CostRoute by complexity, cache aggressively, monitor cost per outcome
TestingLayer unit → integration → evaluation → canary tests
ObservabilityTrace every execution, dashboard four views, alert on thresholds
DeploymentCanary or shadow mode; never big-bang deployments for agents

Congratulations — you've completed the Agentic AI Workflows learning path. You now have the architecture patterns, implementation strategies, and production practices to build AI agent systems that are reliable, cost-effective, and observable.

Apply what you've learned:

On this page