Building Production Agents
Deploy reliable AI agents with cost optimization, testing strategies, observability, and battle-tested deployment patterns.
What you'll learn in this module
- How to make agents reliable with retries, fallbacks, and graceful degradation
- How to control and optimize costs in production
- How to test agents systematically before and after deployment
- How to deploy with observability and monitoring
Reliability Engineering
Production agents fail. External APIs go down, LLMs hallucinate, rate limits hit. Reliability engineering is about designing for these failures.
The reliability hierarchy
Retry strategies
Not all failures are equal. Match your retry strategy to the failure type:
| Failure type | Retry? | Strategy |
|---|---|---|
| Rate limit (429) | Yes | Exponential backoff with jitter |
| Timeout | Yes (once) | Increase timeout; if still fails, use fallback |
| Invalid output format | Yes | Retry with stricter prompt; add format examples |
| Hallucination | No | Switch to evaluation + retry with feedback |
| Auth failure (401) | No | Fix credentials; don't retry |
| Input too large | No | Truncate or chunk input; don't retry as-is |
Exponential backoff implementation
async function withRetry<T>(
fn: () => Promise<T>,
maxRetries = 3,
baseDelay = 1000,
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries) throw error;
const isRetryable =
error.status === 429 || error.status === 503 || error.code === "TIMEOUT";
if (!isRetryable) throw error;
// Exponential backoff with jitter
const delay = baseDelay * Math.pow(2, attempt) + Math.random() * 1000;
await new Promise((resolve) => setTimeout(resolve, delay));
}
}
throw new Error("Unreachable");
}Fallback chains
When the primary model is unavailable or too slow, fall back to alternatives:
Orchestrator mapping: In Orchestrator, you can build fallback chains using Condition nodes. If an AI action fails, the condition checks the error status and routes to an alternative AI action with a different model or integration.
Cost Optimization
AI agents can be expensive. A poorly designed agent might make 50 LLM calls to answer a simple question. Systematic cost optimization keeps budgets under control.
Cost drivers
| Factor | Impact | Mitigation |
|---|---|---|
| Input tokens | Proportional to context size | Trim context, use summaries, compress history |
| Output tokens | Proportional to response length | Set max_tokens, ask for concise responses |
| Number of calls | Multiplied per step × retries | Cache results, limit iterations, use routing |
| Model tier | 10-50x cost difference between tiers | Route simple tasks to cheaper models |
The model routing pattern
Not every task needs the most powerful model. Route based on complexity:
Caching strategies
| Cache type | What to cache | When to use |
|---|---|---|
| Exact match | Identical inputs → cached outputs | FAQ-style queries, repeated classification |
| Semantic cache | Similar inputs (by embedding distance) → cached outputs | Search queries, summaries of similar content |
| Result cache | Tool/API call results | External data that doesn't change frequently |
Cost monitoring
Track these metrics per agent, per day:
| Metric | Why |
|---|---|
| Total token usage | Overall cost tracking |
| Tokens per task | Efficiency — is it getting worse over time? |
| Calls per task | Agent "chattiness" — too many round trips |
| Cache hit rate | Is caching working? |
| Cost per successful outcome | The metric that matters |
Testing Strategies
Agents are non-deterministic. You can't write a unit test that asserts exact output. Instead, use a layered testing approach.
Testing pyramid for agents
Unit tests
Test everything that doesn't involve an LLM call:
// Test tool adapters
test("listOpenIssues returns simplified format", async () => {
const issues = await listOpenIssues("org/repo");
expect(issues[0]).toHaveProperty("number");
expect(issues[0]).toHaveProperty("title");
expect(issues[0]).not.toHaveProperty("body"); // Should be stripped
});
// Test input validation
test("rejects invalid repo format", async () => {
await expect(listOpenIssues("not-a-repo")).rejects.toThrow("owner/repo");
});
// Test output parsers
test("parseAgentResponse handles malformed JSON", () => {
const result = parseAgentResponse("```json\n{broken\n```");
expect(result.success).toBe(false);
expect(result.error).toContain("JSON");
});Evaluation tests
Build a test dataset that covers your agent's expected scenarios:
[
{
"input": "What are the open bugs for project-x?",
"expected_tools": ["list_open_issues"],
"expected_contains": ["bug", "project-x"],
"max_tool_calls": 3,
"max_latency_ms": 5000
},
{
"input": "Summarize last week's standup notes",
"expected_tools": ["search_documents", "summarize"],
"expected_contains": ["standup", "summary"],
"max_tool_calls": 5,
"max_latency_ms": 10000
}
]Run the evaluation:
| Metric | What to measure | Acceptable threshold |
|---|---|---|
| Task completion rate | Did the agent produce a valid answer? | > 95% |
| Tool selection accuracy | Did it use the right tools? | > 90% |
| Output quality (LLM-judge) | Score from the evaluation LLM | > 4.0/5.0 |
| Latency | Time from input to final output | < P95 target |
| Cost per task | Total token spend | < budget per task |
Regression testing
After every change to prompts, tools, or models:
- Run the full evaluation test suite
- Compare metrics against the previous baseline
- Flag any regression > 5% in key metrics
- Investigate before deploying
Observability
You can't fix what you can't see. Agent observability means capturing enough data to diagnose failures and optimize performance.
What to log
| Data | Purpose |
|---|---|
| Full conversation trace | Reproduce the exact agent path |
| Per-step latency | Identify bottleneck steps |
| Token usage per call | Cost attribution |
| Tool inputs and outputs | Debug tool failures |
| LLM response metadata | Model version, finish reason, token counts |
| Evaluation scores | Track quality over time |
Tracing structure
Each span captures:
- Start/end timestamp
- Operation type (LLM call, tool call, evaluation)
- Token usage (for LLM calls)
- Input/output (optionally sampled for privacy)
- Error information (if any)
Dashboards
Build dashboards around these four views:
| Dashboard | Shows | Alert if |
|---|---|---|
| Health | Success rate, error rate, latency P50/P95/P99 | Error rate > 5% |
| Cost | Daily spend, tokens per task, cost per outcome | Spend > daily budget |
| Quality | Eval scores, user feedback, task completion rate | Quality score drops > 10% |
| Throughput | Tasks per hour, queue depth, concurrent executions | Queue depth growing |
Orchestrator mapping: Orchestrator provides built-in observability via the Executions panel. Each execution shows per-step status, duration, inputs/outputs (redacted), and error details. For code-exported workflows, add your own tracing with the "use step" directive.
Deployment Patterns
Canary deployment
Roll out changes to a small percentage of traffic first:
Shadow mode
Run the new agent in parallel without serving its results to users:
- Send every request to both v1 and v2
- Serve v1's response to the user
- Log and evaluate v2's response silently
- Compare quality metrics before switching
Feature flags
Wrap agent capabilities behind feature flags:
async function handleTask(task: Task) {
if (featureFlags.get("use_new_planner")) {
return await newPlannerAgent(task);
}
return await currentAgent(task);
}This lets you enable new capabilities for specific users, teams, or a percentage of traffic.
Production Checklist
Before deploying any agent to production:
Reliability
- Retry logic with exponential backoff for transient failures
- Fallback model or cached response for complete outages
- Iteration limits on all loops (max 10-15 iterations)
- Timeout on all external calls
- Input validation at system boundary
Cost
- Token budget per task
- Model routing (cheap model for simple tasks)
- Caching for repeated queries
- Daily spend alerts
Testing
- Unit tests for all tool adapters
- Evaluation dataset with 50+ test cases
- Quality baseline established
- Regression tests passing
Observability
- Execution tracing enabled
- Dashboards for health, cost, quality, throughput
- Alerts configured for error rate and cost spikes
- Log retention policy set
Security
- Tools have minimal permissions (principle of least privilege)
- User input is sanitized before inclusion in prompts
- Sensitive data redacted from logs
- Agent scope boundaries documented and enforced
What you now understand
| Area | Key takeaway |
|---|---|
| Reliability | Five levels: validation → retries → fallbacks → degradation → escalation |
| Cost | Route by complexity, cache aggressively, monitor cost per outcome |
| Testing | Layer unit → integration → evaluation → canary tests |
| Observability | Trace every execution, dashboard four views, alert on thresholds |
| Deployment | Canary or shadow mode; never big-bang deployments for agents |
Congratulations — you've completed the Agentic AI Workflows learning path. You now have the architecture patterns, implementation strategies, and production practices to build AI agent systems that are reliable, cost-effective, and observable.
Apply what you've learned:
- Build agentic workflows visually in Skytells Orchestrator
- Use the Skytells AI API to power your agents with state-of-the-art models
- Explore the Orchestrator Mastery path to learn the visual workflow builder in depth