Planning, Memory & Evaluation
Learn how agents decompose complex tasks, maintain context across steps, and systematically evaluate outputs for reliability.
What you'll learn in this module
- How agents plan and decompose tasks (with and without feedback)
- How to design memory systems that give agents the right context
- How to build evaluation pipelines that catch failures before users do
- When to use LLM-as-judge vs. programmatic evaluation
Planning: How Agents Decompose Tasks
Planning is the process of converting a high-level goal into a sequence of actionable steps. It sits at the heart of every agentic system that handles tasks more complex than a single tool call.
Planning without feedback
The simplest approach: generate the entire plan upfront, then execute it.
Pros: Predictable, fast, easy to debug Cons: Can't adapt if early steps fail or return unexpected results
Planning with feedback (ReAct)
The ReAct pattern interleaves reasoning and acting. After each step, the agent reflects on the result and adjusts its plan.
Each iteration produces a reasoning trace:
Thought: I need to find the user's account details before I can process the refund.
Action: search_users(email="[email protected]")
Observation: Found user #4521, account active, 3 recent orders.
Thought: User found. Now I need to check if order #8891 is eligible for refund.
Action: get_order(order_id=8891)
Observation: Order #8891, $49.99, delivered 2 days ago. Return window: 30 days.
Thought: Order is within return window. I can process the refund.
Action: process_refund(order_id=8891, amount=49.99)Planning with reflection
Add a dedicated reflection step after key milestones. The reflection LLM reviews progress and can modify the remaining plan.
Choosing a planning strategy
| Strategy | Best for | Cost | Reliability |
|---|---|---|---|
| Plan-then-execute | Well-understood tasks with predictable steps | Low | High if task is routine |
| ReAct | Tasks requiring adaptation based on intermediate results | Medium | High for variable inputs |
| Plan + Reflect | Long-running tasks where mid-course correction matters | High | Highest |
Memory: Giving Agents the Right Context
LLMs have a fixed context window. Agents that run multi-step tasks need to manage what's in that window carefully.
Types of agent memory
| Memory type | Lifespan | Implementation |
|---|---|---|
| Working memory | Current task only | The prompt itself — include recent context, tool results |
| Short-term memory | Current session | Conversation history, rolling summary |
| Long-term memory | Across sessions | Database, vector store, user profile |
Practical memory patterns
Pattern 1: Sliding window
Keep the most recent N messages in context. Simple and effective for conversational agents.
[System prompt] + [Last 10 messages] + [Current message]Pattern 2: Summary + recent
Periodically summarize older context into a compressed form:
[System prompt] + [Summary of messages 1-50] + [Messages 51-60] + [Current]Pattern 3: Retrieval-augmented memory
Store all interactions in a vector database. Before each LLM call, retrieve the most relevant past interactions:
Memory and Orchestrator
| Orchestrator feature | Memory function |
|---|---|
| Template variables | Working memory — pass data between steps within a single execution |
| Execution logs | Short-term memory — review what happened in past runs |
| Integrations | External memory — connect to databases, vector stores, knowledge bases |
The key insight: most production agents work fine with working memory (template variables / step outputs) plus retrieval for domain knowledge. Full long-term memory is only needed for personalized, session-spanning agents.
Evaluation: Catching Failures Before Users Do
If you can't measure it, you can't improve it. Agent evaluation is how you go from "it seems to work" to "it works reliably."
The evaluation spectrum
Code-based evaluation
Programmatic checks for objective criteria:
| Check | Implementation |
|---|---|
| Format validation | Does the output match the expected JSON schema? |
| Length constraints | Is the summary between 50-200 words? |
| Required fields | Does the response include all mandatory fields? |
| Factual anchoring | Do cited facts appear in the source material? |
| Safety filters | Does the output contain blocked patterns? |
function evaluate(output: AgentOutput): EvalResult {
const checks = {
validJson: isValidJson(output.text),
withinLength: output.text.length >= 50 && output.text.length <= 1000,
hasRequiredFields: requiredFields.every((f) => output.data[f] !== undefined),
noBlockedContent: !blockedPatterns.some((p) => p.test(output.text)),
};
return {
passed: Object.values(checks).every(Boolean),
checks,
};
}LLM-as-judge
Use an LLM to evaluate another LLM's output. Best for subjective or nuanced criteria.
You are evaluating an AI assistant's response. Score each criterion 1-5:
1. **Relevance**: Does the response address the user's actual question?
2. **Accuracy**: Are all facts and claims correct?
3. **Completeness**: Does it cover all important aspects?
4. **Clarity**: Is it well-structured and easy to understand?
Response to evaluate:
{{response}}
Original question:
{{question}}
For each criterion, provide:
- Score (1-5)
- One-sentence justification
Then provide an overall PASS/FAIL recommendation.When to use which
| Method | Use for | Accuracy | Cost | Speed |
|---|---|---|---|---|
| Code checks | Objective, measurable criteria | High for what it covers | Negligible | Instant |
| LLM-as-judge | Subjective quality, tone, completeness | Good, ~80% agreement with humans | Moderate | Seconds |
| Human review | Final quality gate, edge cases, safety | Best | High | Minutes-hours |
Building an evaluation pipeline
For production agents, combine all three:
Building your evaluation dataset
Start collecting evaluation data from day one:
- Log all inputs and outputs — this is your raw evaluation corpus
- Manually label a subset — even 50-100 labeled examples are valuable
- Track failure patterns — categorize failures to focus improvement
- Version your eval set — as you fix failure modes, add new test cases
Putting It Together: A Complete Agent Design
Combining planning, memory, and evaluation into a production-ready agent:
What you now understand
| Concept | Key takeaway |
|---|---|
| Planning | Choose plan-then-execute for routine tasks, ReAct for adaptive tasks, reflection for long-running tasks |
| Memory | Working memory (step outputs) handles most cases; add retrieval for domain knowledge |
| Code evaluation | Fast, cheap — use for all objective criteria |
| LLM-as-judge | Flexible — use for subjective quality with explicit rubrics |
| Eval pipeline | Layer code checks → LLM judge → human review for production reliability |
| Eval dataset | Start collecting day one; version and grow it as you find failure modes |
Up next: Building Production Agents — reliability engineering, cost optimization, testing strategies, and deployment patterns.
Tool Use & Function Calling
Design effective tools for AI agents, implement function calling with structured outputs, and connect agents to external APIs safely.
Building Production Agents
Deploy reliable AI agents with cost optimization, testing strategies, observability, and battle-tested deployment patterns.