Planning, Memory & Evaluation

Learn how agents decompose complex tasks, maintain context across steps, and systematically evaluate outputs for reliability.

What you'll learn in this module

How agents plan and decompose tasks (with and without feedback)
How to design memory systems that give agents the right context
How to build evaluation pipelines that catch failures before users do
When to use LLM-as-judge vs. programmatic evaluation

Planning: How Agents Decompose Tasks

Planning is the process of converting a high-level goal into a sequence of actionable steps. It sits at the heart of every agentic system that handles tasks more complex than a single tool call.

Planning without feedback

The simplest approach: generate the entire plan upfront, then execute it.

Pros: Predictable, fast, easy to debug Cons: Can't adapt if early steps fail or return unexpected results

Planning with feedback (ReAct)

The ReAct pattern interleaves reasoning and acting. After each step, the agent reflects on the result and adjusts its plan.

Each iteration produces a reasoning trace:

Thought: I need to find the user's account details before I can process the refund.
Action: search_users(email="[email protected]")
Observation: Found user #4521, account active, 3 recent orders.
Thought: User found. Now I need to check if order #8891 is eligible for refund.
Action: get_order(order_id=8891)
Observation: Order #8891, $49.99, delivered 2 days ago. Return window: 30 days.
Thought: Order is within return window. I can process the refund.
Action: process_refund(order_id=8891, amount=49.99)

Planning with reflection

Add a dedicated reflection step after key milestones. The reflection LLM reviews progress and can modify the remaining plan.

Choosing a planning strategy

Strategy	Best for	Cost	Reliability
Plan-then-execute	Well-understood tasks with predictable steps	Low	High if task is routine
ReAct	Tasks requiring adaptation based on intermediate results	Medium	High for variable inputs
Plan + Reflect	Long-running tasks where mid-course correction matters	High	Highest

Memory: Giving Agents the Right Context

LLMs have a fixed context window. Agents that run multi-step tasks need to manage what's in that window carefully.

Types of agent memory

Memory type	Lifespan	Implementation
Working memory	Current task only	The prompt itself — include recent context, tool results
Short-term memory	Current session	Conversation history, rolling summary
Long-term memory	Across sessions	Database, vector store, user profile

Practical memory patterns

Pattern 1: Sliding window

Keep the most recent N messages in context. Simple and effective for conversational agents.

[System prompt] + [Last 10 messages] + [Current message]

Pattern 2: Summary + recent

Periodically summarize older context into a compressed form:

[System prompt] + [Summary of messages 1-50] + [Messages 51-60] + [Current]

Pattern 3: Retrieval-augmented memory

Store all interactions in a vector database. Before each LLM call, retrieve the most relevant past interactions:

Memory and Orchestrator

Orchestrator feature	Memory function
Template variables	Working memory — pass data between steps within a single execution
Execution logs	Short-term memory — review what happened in past runs
Integrations	External memory — connect to databases, vector stores, knowledge bases

The key insight: most production agents work fine with working memory (template variables / step outputs) plus retrieval for domain knowledge. Full long-term memory is only needed for personalized, session-spanning agents.

Check	Implementation
Format validation	Does the output match the expected JSON schema?
Length constraints	Is the summary between 50-200 words?
Required fields	Does the response include all mandatory fields?
Factual anchoring	Do cited facts appear in the source material?
Safety filters	Does the output contain blocked patterns?

function evaluate(output: AgentOutput): EvalResult {
  const checks = {
    validJson: isValidJson(output.text),
    withinLength: output.text.length >= 50 && output.text.length <= 1000,
    hasRequiredFields: requiredFields.every((f) => output.data[f] !== undefined),
    noBlockedContent: !blockedPatterns.some((p) => p.test(output.text)),
  };

  return {
    passed: Object.values(checks).every(Boolean),
    checks,
  };
}

LLM-as-judge

Use an LLM to evaluate another LLM's output. Best for subjective or nuanced criteria.

You are evaluating an AI assistant's response. Score each criterion 1-5:

1. **Relevance**: Does the response address the user's actual question?
2. **Accuracy**: Are all facts and claims correct?
3. **Completeness**: Does it cover all important aspects?
4. **Clarity**: Is it well-structured and easy to understand?

Response to evaluate:
{{response}}

Original question:
{{question}}

For each criterion, provide:
- Score (1-5)
- One-sentence justification

Then provide an overall PASS/FAIL recommendation.

When to use which

Method	Use for	Accuracy	Cost	Speed
Code checks	Objective, measurable criteria	High for what it covers	Negligible	Instant
LLM-as-judge	Subjective quality, tone, completeness	Good, ~80% agreement with humans	Moderate	Seconds
Human review	Final quality gate, edge cases, safety	Best	High	Minutes-hours

Building an evaluation pipeline

For production agents, combine all three:

Building your evaluation dataset

Start collecting evaluation data from day one:

Log all inputs and outputs — this is your raw evaluation corpus
Manually label a subset — even 50-100 labeled examples are valuable
Track failure patterns — categorize failures to focus improvement
Version your eval set — as you fix failure modes, add new test cases

Putting It Together: A Complete Agent Design

Combining planning, memory, and evaluation into a production-ready agent:

What you now understand

Concept	Key takeaway
Planning	Choose plan-then-execute for routine tasks, ReAct for adaptive tasks, reflection for long-running tasks
Memory	Working memory (step outputs) handles most cases; add retrieval for domain knowledge
Code evaluation	Fast, cheap — use for all objective criteria
LLM-as-judge	Flexible — use for subjective quality with explicit rubrics
Eval pipeline	Layer code checks → LLM judge → human review for production reliability
Eval dataset	Start collecting day one; version and grow it as you find failure modes

Up next: Building Production Agents — reliability engineering, cost optimization, testing strategies, and deployment patterns.

PreviousTool Use & Function Calling NextBuilding Production Agents