Advanced40 minModule 5 of 6

Planning, Memory & Evaluation

Learn how agents decompose complex tasks, maintain context across steps, and systematically evaluate outputs for reliability.

What you'll learn in this module

  • How agents plan and decompose tasks (with and without feedback)
  • How to design memory systems that give agents the right context
  • How to build evaluation pipelines that catch failures before users do
  • When to use LLM-as-judge vs. programmatic evaluation

Planning: How Agents Decompose Tasks

Planning is the process of converting a high-level goal into a sequence of actionable steps. It sits at the heart of every agentic system that handles tasks more complex than a single tool call.

Planning without feedback

The simplest approach: generate the entire plan upfront, then execute it.

Goal Planner LLM:Generate step list Execute step 1 Execute step 2 Execute step N Result

Pros: Predictable, fast, easy to debug Cons: Can't adapt if early steps fail or return unexpected results

Planning with feedback (ReAct)

The ReAct pattern interleaves reasoning and acting. After each step, the agent reflects on the result and adjusts its plan.

No Yes Goal Think: What should I do next? Act: Execute tool/action Observe: Process result Goal achieved? Result

Each iteration produces a reasoning trace:

Thought: I need to find the user's account details before I can process the refund.
Action: search_users(email="[email protected]")
Observation: Found user #4521, account active, 3 recent orders.
Thought: User found. Now I need to check if order #8891 is eligible for refund.
Action: get_order(order_id=8891)
Observation: Order #8891, $49.99, delivered 2 days ago. Return window: 30 days.
Thought: Order is within return window. I can process the refund.
Action: process_refund(order_id=8891, amount=49.99)

Planning with reflection

Add a dedicated reflection step after key milestones. The reflection LLM reviews progress and can modify the remaining plan.

Adjust plan Continue Goal Plan Execute steps 1-3 Reflect:Is the approach working?Are results on track?Should I adjust? Revised Plan Execute steps 4-6 Result

Choosing a planning strategy

StrategyBest forCostReliability
Plan-then-executeWell-understood tasks with predictable stepsLowHigh if task is routine
ReActTasks requiring adaptation based on intermediate resultsMediumHigh for variable inputs
Plan + ReflectLong-running tasks where mid-course correction mattersHighHighest

Memory: Giving Agents the Right Context

LLMs have a fixed context window. Agents that run multi-step tasks need to manage what's in that window carefully.

Types of agent memory

Short-term Memory Long-term Memory External Memory Current task context Recent tool results Conversation history User preferences Past task summaries Domain knowledge Vector store / RAG Database File system
Memory typeLifespanImplementation
Working memoryCurrent task onlyThe prompt itself — include recent context, tool results
Short-term memoryCurrent sessionConversation history, rolling summary
Long-term memoryAcross sessionsDatabase, vector store, user profile

Practical memory patterns

Pattern 1: Sliding window

Keep the most recent N messages in context. Simple and effective for conversational agents.

[System prompt] + [Last 10 messages] + [Current message]

Pattern 2: Summary + recent

Periodically summarize older context into a compressed form:

[System prompt] + [Summary of messages 1-50] + [Messages 51-60] + [Current]

Pattern 3: Retrieval-augmented memory

Store all interactions in a vector database. Before each LLM call, retrieve the most relevant past interactions:

Current query Embed query Search vector store Top-K relevant memories Include in prompt LLM generates response

Memory and Orchestrator

Orchestrator featureMemory function
Template variablesWorking memory — pass data between steps within a single execution
Execution logsShort-term memory — review what happened in past runs
IntegrationsExternal memory — connect to databases, vector stores, knowledge bases

Evaluation: Catching Failures Before Users Do

If you can't measure it, you can't improve it. Agent evaluation is how you go from "it seems to work" to "it works reliably."

The evaluation spectrum

Code-based checks(fastest, cheapest) LLM-as-judge(flexible, moderate cost) Human review(most accurate, most expensive)

Code-based evaluation

Programmatic checks for objective criteria:

CheckImplementation
Format validationDoes the output match the expected JSON schema?
Length constraintsIs the summary between 50-200 words?
Required fieldsDoes the response include all mandatory fields?
Factual anchoringDo cited facts appear in the source material?
Safety filtersDoes the output contain blocked patterns?
function evaluate(output: AgentOutput): EvalResult {
  const checks = {
    validJson: isValidJson(output.text),
    withinLength: output.text.length >= 50 && output.text.length <= 1000,
    hasRequiredFields: requiredFields.every((f) => output.data[f] !== undefined),
    noBlockedContent: !blockedPatterns.some((p) => p.test(output.text)),
  };

  return {
    passed: Object.values(checks).every(Boolean),
    checks,
  };
}

LLM-as-judge

Use an LLM to evaluate another LLM's output. Best for subjective or nuanced criteria.

You are evaluating an AI assistant's response. Score each criterion 1-5:

1. **Relevance**: Does the response address the user's actual question?
2. **Accuracy**: Are all facts and claims correct?
3. **Completeness**: Does it cover all important aspects?
4. **Clarity**: Is it well-structured and easy to understand?

Response to evaluate:
{{response}}

Original question:
{{question}}

For each criterion, provide:
- Score (1-5)
- One-sentence justification

Then provide an overall PASS/FAIL recommendation.

When to use which

MethodUse forAccuracyCostSpeed
Code checksObjective, measurable criteriaHigh for what it coversNegligibleInstant
LLM-as-judgeSubjective quality, tone, completenessGood, ~80% agreement with humansModerateSeconds
Human reviewFinal quality gate, edge cases, safetyBestHighMinutes-hours

Building an evaluation pipeline

For production agents, combine all three:

Fail Pass Score < threshold Score ≥ threshold Agent Output Code Checks(format, length, safety) Reject + retry LLM-as-Judge(relevance, accuracy, quality) Flag for review Accept Human Review(sample-based) Add to eval dataset

Building your evaluation dataset

Start collecting evaluation data from day one:

  1. Log all inputs and outputs — this is your raw evaluation corpus
  2. Manually label a subset — even 50-100 labeled examples are valuable
  3. Track failure patterns — categorize failures to focus improvement
  4. Version your eval set — as you fix failure modes, add new test cases

Putting It Together: A Complete Agent Design

Combining planning, memory, and evaluation into a production-ready agent:

Pass Fail Pass Below threshold User Task Planner:Decompose into steps Memory Retrieval:Fetch relevant context Execute Step 1+ Working Memory Evaluate:Code checks Execute Step 2 Retry / adjust plan Evaluate:LLM-as-judge Final Output Reflect + replan

What you now understand

ConceptKey takeaway
PlanningChoose plan-then-execute for routine tasks, ReAct for adaptive tasks, reflection for long-running tasks
MemoryWorking memory (step outputs) handles most cases; add retrieval for domain knowledge
Code evaluationFast, cheap — use for all objective criteria
LLM-as-judgeFlexible — use for subjective quality with explicit rubrics
Eval pipelineLayer code checks → LLM judge → human review for production reliability
Eval datasetStart collecting day one; version and grow it as you find failure modes

Up next: Building Production Agents — reliability engineering, cost optimization, testing strategies, and deployment patterns.

On this page