Advanced45 minModule 6 of 6

Building Production Agents

Deploy reliable AI agents with cost optimization, testing strategies, observability, and battle-tested deployment patterns.

What you'll learn in this module

How to make agents reliable with retries, fallbacks, and graceful degradation
How to control and optimize costs in production
How to test agents systematically before and after deployment
How to deploy with observability and monitoring

Reliability Engineering

Production agents fail. External APIs go down, LLMs hallucinate, rate limits hit. Reliability engineering is about designing for these failures.

Retry strategies

Not all failures are equal. Match your retry strategy to the failure type:

Failure type	Retry?	Strategy
Rate limit (429)	Yes	Exponential backoff with jitter
Timeout	Yes (once)	Increase timeout; if still fails, use fallback
Invalid output format	Yes	Retry with stricter prompt; add format examples
Hallucination	No	Switch to evaluation + retry with feedback
Auth failure (401)	No	Fix credentials; don't retry
Input too large	No	Truncate or chunk input; don't retry as-is

Exponential backoff implementation

async function withRetry<T>(
  fn: () => Promise<T>,
  maxRetries = 3,
  baseDelay = 1000,
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries) throw error;

      const isRetryable =
        error.status === 429 || error.status === 503 || error.code === "TIMEOUT";

      if (!isRetryable) throw error;

      // Exponential backoff with jitter
      const delay = baseDelay * Math.pow(2, attempt) + Math.random() * 1000;
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }
  throw new Error("Unreachable");
}

Fallback chains

When the primary model is unavailable or too slow, fall back to alternatives:

Orchestrator mapping: In Orchestrator, you can build fallback chains using Condition nodes. If an AI action fails, the condition checks the error status and routes to an alternative AI action with a different model or integration.

Cost Optimization

AI agents can be expensive. A poorly designed agent might make 50 LLM calls to answer a simple question. Systematic cost optimization keeps budgets under control.

Cost drivers

Factor	Impact	Mitigation
Input tokens	Proportional to context size	Trim context, use summaries, compress history
Output tokens	Proportional to response length	Set max_tokens, ask for concise responses
Number of calls	Multiplied per step × retries	Cache results, limit iterations, use routing
Model tier	10-50x cost difference between tiers	Route simple tasks to cheaper models

The model routing pattern

Not every task needs the most powerful model. Route based on complexity:

Caching strategies

Cache type	What to cache	When to use
Exact match	Identical inputs → cached outputs	FAQ-style queries, repeated classification
Semantic cache	Similar inputs (by embedding distance) → cached outputs	Search queries, summaries of similar content
Result cache	Tool/API call results	External data that doesn't change frequently

Cost monitoring

Track these metrics per agent, per day:

Metric	Why
Total token usage	Overall cost tracking
Tokens per task	Efficiency — is it getting worse over time?
Calls per task	Agent "chattiness" — too many round trips
Cache hit rate	Is caching working?
Cost per successful outcome	The metric that matters

Testing Strategies

Agents are non-deterministic. You can't write a unit test that asserts exact output. Instead, use a layered testing approach.

Testing pyramid for agents

Unit tests

Test everything that doesn't involve an LLM call:

// Test tool adapters
test("listOpenIssues returns simplified format", async () => {
  const issues = await listOpenIssues("org/repo");
  expect(issues[0]).toHaveProperty("number");
  expect(issues[0]).toHaveProperty("title");
  expect(issues[0]).not.toHaveProperty("body"); // Should be stripped
});

// Test input validation
test("rejects invalid repo format", async () => {
  await expect(listOpenIssues("not-a-repo")).rejects.toThrow("owner/repo");
});

// Test output parsers
test("parseAgentResponse handles malformed JSON", () => {
  const result = parseAgentResponse("```json\n{broken\n```");
  expect(result.success).toBe(false);
  expect(result.error).toContain("JSON");
});

Evaluation tests

Build a test dataset that covers your agent's expected scenarios:

[
  {
    "input": "What are the open bugs for project-x?",
    "expected_tools": ["list_open_issues"],
    "expected_contains": ["bug", "project-x"],
    "max_tool_calls": 3,
    "max_latency_ms": 5000
  },
  {
    "input": "Summarize last week's standup notes",
    "expected_tools": ["search_documents", "summarize"],
    "expected_contains": ["standup", "summary"],
    "max_tool_calls": 5,
    "max_latency_ms": 10000
  }
]

Run the evaluation:

Metric	What to measure	Acceptable threshold
Task completion rate	Did the agent produce a valid answer?	> 95%
Tool selection accuracy	Did it use the right tools?	> 90%
Output quality (LLM-judge)	Score from the evaluation LLM	> 4.0/5.0
Latency	Time from input to final output	< P95 target
Cost per task	Total token spend	< budget per task

Regression testing

After every change to prompts, tools, or models:

Run the full evaluation test suite
Compare metrics against the previous baseline
Flag any regression > 5% in key metrics
Investigate before deploying

Observability

You can't fix what you can't see. Agent observability means capturing enough data to diagnose failures and optimize performance.

What to log

Data	Purpose
Full conversation trace	Reproduce the exact agent path
Per-step latency	Identify bottleneck steps
Token usage per call	Cost attribution
Tool inputs and outputs	Debug tool failures
LLM response metadata	Model version, finish reason, token counts
Evaluation scores	Track quality over time

Tracing structure

Each span captures:

Start/end timestamp
Operation type (LLM call, tool call, evaluation)
Token usage (for LLM calls)
Input/output (optionally sampled for privacy)
Error information (if any)

Dashboards

Build dashboards around these four views:

Dashboard	Shows	Alert if
Health	Success rate, error rate, latency P50/P95/P99	Error rate > 5%
Cost	Daily spend, tokens per task, cost per outcome	Spend > daily budget
Quality	Eval scores, user feedback, task completion rate	Quality score drops > 10%
Throughput	Tasks per hour, queue depth, concurrent executions	Queue depth growing

Orchestrator mapping: Orchestrator provides built-in observability via the Executions panel. Each execution shows per-step status, duration, inputs/outputs (redacted), and error details. For code-exported workflows, add your own tracing with the "use step" directive.

Deployment Patterns

Canary deployment

Roll out changes to a small percentage of traffic first:

Shadow mode

Run the new agent in parallel without serving its results to users:

Send every request to both v1 and v2
Serve v1's response to the user
Log and evaluate v2's response silently
Compare quality metrics before switching

Feature flags

Wrap agent capabilities behind feature flags:

async function handleTask(task: Task) {
  if (featureFlags.get("use_new_planner")) {
    return await newPlannerAgent(task);
  }
  return await currentAgent(task);
}

This lets you enable new capabilities for specific users, teams, or a percentage of traffic.

Production Checklist

Before deploying any agent to production:

Reliability

Retry logic with exponential backoff for transient failures
Fallback model or cached response for complete outages
Iteration limits on all loops (max 10-15 iterations)
Timeout on all external calls
Input validation at system boundary

Cost

Token budget per task
Model routing (cheap model for simple tasks)
Caching for repeated queries
Daily spend alerts

Testing

Unit tests for all tool adapters
Evaluation dataset with 50+ test cases
Quality baseline established
Regression tests passing

Observability

Execution tracing enabled
Dashboards for health, cost, quality, throughput
Alerts configured for error rate and cost spikes
Log retention policy set

Security

Tools have minimal permissions (principle of least privilege)
User input is sanitized before inclusion in prompts
Sensitive data redacted from logs
Agent scope boundaries documented and enforced

What you now understand

Area	Key takeaway
Reliability	Five levels: validation → retries → fallbacks → degradation → escalation
Cost	Route by complexity, cache aggressively, monitor cost per outcome
Testing	Layer unit → integration → evaluation → canary tests
Observability	Trace every execution, dashboard four views, alert on thresholds
Deployment	Canary or shadow mode; never big-bang deployments for agents

Congratulations — you've completed the Agentic AI Workflows learning path. You now have the architecture patterns, implementation strategies, and production practices to build AI agent systems that are reliable, cost-effective, and observable.

Apply what you've learned:

Build agentic workflows visually in Skytells Orchestrator
Use the Skytells AI API to power your agents with state-of-the-art models
Explore the Orchestrator Mastery path to learn the visual workflow builder in depth

PreviousPlanning, Memory & Evaluation

Planning, Memory & Evaluation

Learn how agents decompose complex tasks, maintain context across steps, and systematically evaluate outputs for reliability.