Cognition: How Agents Plan, Reason, and Self-Correct (NCP-AAI Module 4)

Module 4 of 14 21 min read D5 · 10% Lab code ↗

This is Module 4 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.

Scout’s ReAct loop handles narrow questions well. So I gave it a broad one: “Compare the EU and US regulatory approaches to AI agents.” It searched “EU AI regulation” — reasonable. Then “EU AI Act agents”. Then “EU AI Act autonomous systems”, and two more rewordings of the same search. Fourteen tool calls later — several per turn — it hit the iteration cap and printed the line you wrote in Module 2: Stopped after 6 iterations without a final answer. The transcript it left behind was deep on the EU side and had given the US side a single search. At no point did anything in that loop ask the question every human researcher asks first: what are the parts of this task?

An agent that acts without a plan burns tokens on redundant work and silently misses entire halves of the job. This module closes that gap: by the end, Scout decomposes the question into a structured plan, critiques its own plan once, then executes — and you’ll measure exactly what that critique pass buys.

In this module

  • You’ll learn:
    • Apply reasoning frameworks — chain-of-thought and task decomposition — and what each one buys you (objective 5.2).
    • Engineer a multi-step planning strategy: ReAct vs. plan-and-execute, chosen on latency, cost, and auditability (5.3).
    • Implement a Planner node that turns a question into a structured, validated research plan living in graph state (5.3, 5.4).
    • Build a one-iteration reflection loop and measure what it adds on five test questions (5.5).
    • Budget reasoning: when deliberation is worth its token cost, and when it isn’t.
  • You’ll build: Scout’s Planner node — question → structured research plan, self-critiqued once before execution.
  • Exam domains covered: D5 — Cognition, Planning, and Memory — 10% of the exam. This module covers the planning half; Module 5 covers memory.
  • Prerequisites: Modules 1–3 (your refactored graph runs); NVIDIA API key configured, plus your Tavily key from Module 2 — compare_plans.py is the only part that runs without the Tavily key (it still needs your NVIDIA key).

Where you are

  • ✅ Module 1 — What Is Agentic AI? — vocabulary, landscape, first NIM call
  • ✅ Module 2 — Build Your First AI Agent — ReAct loop, tool calling, first graph
  • ✅ Module 3 — Agent Architecture — patterns, trade-offs, the design doc
  • 👉 Module 4 — Cognition: planning, reasoning, self-correction (you are here)
  • ⬜ Modules 5–14 — memory, RAG, multi-agent, evals, guardrails, deployment, the exam

Scout before: a single-agent ReAct loop, reorganized along the design doc but still purely reactive — it acts on the last observation, nothing more. Scout after: a planner → critic → planner → executor graph. The plan is a typed object in state, criticized and revised once before a single search runs.

From reactive to deliberative: why agents plan

Module 3 named the three system temperaments from the study guide’s job description; this module builds the second one. Per the official study guide, a reactive system decides its next action from the current observation — sense, act, repeat. A deliberative system constructs a multi-step plan toward a goal before executing it. A hybrid system layers both: deliberate planning above, reactive execution below — which is exactly where Scout lands today.

Why does reactivity fail on broad tasks? Every decision in a ReAct loop is local — picked from the transcript so far. Nothing in the loop owns the global shape of the task. On a narrow question, local decisions suffice; on “compare two regulatory regimes,” the loop optimizes each next search and never notices that the US half of the comparison is starving.

The fix is the oldest tool in engineering: task decomposition — breaking a complex task into smaller, independently verifiable subtasks before solving any of them. Decomposition buys three things:

  • Coverage: enumerating the facets up front is the only reliable way to notice one is missing — before the run, not after.
  • Focus: each subtask gets a context window about its job, instead of one crowded transcript juggling every thread at once.
  • Verifiability: a subtask with its own expected output can be checked off — “found the 2025 EU enforcement actions” either happened or didn’t.

Decomposed steps are also what Scout’s specialists will divide between themselves when it becomes a team in Module 7.

Here’s the opening question, decomposed the way Scout’s Planner will learn to do it:

  1. Inventory the EU rules for AI agents — expected output: the instruments and their key obligations.
  2. Inventory the US approach — expected output: federal and state-level actions, and what they regulate.
  3. Compare enforcement and scope — expected output: a side-by-side of who enforces what, and on whom.
  4. Collect open disagreements — expected output: where the regimes conflict for an agent operating in both.

Four steps, each verifiable, jointly covering the question. The fourteen-call spiral never produced step 2.

Reasoning frameworks: chain-of-thought and beyond

The study guide files two reasoning frameworks under objective 5.2, and the exam expects you to keep them apart.

Chain-of-thought (CoT) is prompting a model to reason step by step in text before committing to an answer — reasoning only, no actions. It improves multi-step answers because each intermediate conclusion conditions the next, and it leaves a trace you can read. But nothing in a chain of thought touches the world: no search, no tool, no observation.

ReAct, which you built in Module 2, interleaves that reasoning with tool actions and feeds each observation back into the next thought. That’s the whole distinction, and it’s worth one slow sentence because the exam loves it: CoT reasons; ReAct reasons and acts. A transcript with Action: and Observation: lines is ReAct; a transcript that’s pure Thought: from question to answer is chain-of-thought, however impressive the thinking looks. A third family from the blueprint, logic trees and prompt chains — branching and fixed-sequence structures where your code controls the flow — was Module 3’s territory (objective 1.6), and remains the right answer when the steps are known in advance.

Reasoning isn’t free, and that bill has a name: the reasoning budget — the tokens, latency, and money you allow a system to spend thinking before it acts. You’ve been paying it since Module 1: Nemotron 3 Nano is a reasoning model, its thinking tokens count against max_tokens, and the labs run with MAX_TOKENS = 8192 because smaller budgets returned empty answers with finish_reason="length" — the reasoning consumed everything. The budget exists at the system level too: Scout’s deliberation phase costs three extra LLM calls (draft, critique, revision) before a single search runs. On a multi-faceted research question, those calls pay for themselves in avoided redundant searches; on “what’s the capital of Australia,” they’re pure waste. Budgeting reasoning is the skill — not maximizing it.

Planning patterns: ReAct vs. plan-and-execute

Objective 5.3 asks you to engineer planning strategies for sequential and multi-step decision-making — in practice, to choose between two patterns and defend the choice.

You know the first: ReAct is interleaved planning — the agent plans exactly one step ahead, executes it, and lets the observation reshape everything that follows. The second is plan-and-execute: the agent drafts a complete multi-step plan up front, then executes the steps, returning to the planner only if something forces a change. Side by side:

flowchart TB
    subgraph react ["ReAct — interleaved"]
        direction TB
        R1[reason: what next?] --> R2[act: one tool call]
        R2 --> R3[observe the result]
        R3 -->|not done| R1
        R3 -->|done| R4([answer])
    end
    subgraph pae ["Plan-and-execute"]
        direction TB
        P1[plan the whole task] --> P2[execute steps 1..n]
        P2 --> P3{step failed or<br/>surprised you?}
        P3 -->|yes| P4[replan the remainder]
        P4 --> P2
        P3 -->|no| P5([answer])
    end

Left: every observation can redirect the run. Right: the global shape is fixed first; observations only redirect through an explicit replan.

The reference table for scenario questions — read the constraint in the question, find the row that decides:

CriterionReAct (interleaved)Plan-and-executeHybrid (plan + replanning)
Time to first actionImmediate — first tool call in one LLM turnDelayed — the full plan is drafted firstDelayed at start, adaptive after
Token costVariable; redundant work on broad tasksPlanning calls up front, then focused executionHighest worst case: plan + execute + replans
Adaptability to surprisesBest — every observation can change courseWorst — a stale plan executes blindlyGood — replans on defined triggers
Auditability / traceabilityOne transcript; intent stays implicitThe plan is an explicit, reviewable artifact before any actionPlan plus recorded replans: full decision history
Human approvalHard — nothing exists to approve before actions startNatural — pause after planning, approve, executeNatural at plan time; re-approve on replan
Typical useNarrow, unpredictable tasks in one domainMulti-faceted research and reports; compliance contextsLong-running production agents in changing environments

Two rows decide most exam scenarios. Auditability: a plan that exists before execution is a reviewable artifact — you can log it, diff it against what actually ran, and show an auditor why the system did what it did; a ReAct transcript only reveals intent after the fact. And human approval: you cannot approve what doesn’t exist yet. Plan-and-execute creates a natural pause between planning and acting; in Module 9, a human will approve Scout’s plan at exactly that seam.

Don’t over-rotate, though: plan-and-execute does not dominate ReAct “because it plans.” It pays its planning calls even when the task didn’t need them, and when reality diverges from the plan — a source is gone, a step’s premise was wrong — a pure plan-executor either executes nonsense or pays again to replan. Unpredictable, narrow tasks remain ReAct territory. That’s why production systems converge on the hybrid column, and why Scout becomes one today — hybrid in the study guide’s layered sense: deliberate plan above, reactive executor below. The table’s replanning triggers are the production upgrade described in the In-production note; today’s Scout freezes its plan after one revision.

One more exam term lives here. Coordinating a multi-step task through a typed, inspectable state object — which steps exist, what each produced, where the run currently stands — is stateful orchestration (objective 5.4 — the blueprint also files the term under 1.6, where Module 3 used it as a workflow tool; 5.4 is the task-coordination facet you build here). The plan can’t live in a prompt string or a local variable: it lives in ScoutState, next to the messages, where every node, every test, and every future module can read it. The payoff is concrete: when step 4 of 6 fails mid-run, a system that kept its plan and its per-step results in state can replan the remainder; a system that didn’t starts over.

Self-correction: reflection loops and their limits

First drafts are mediocre — for plans as much as prose. Reflection is the self-correction pattern: a model (or a separate critic role) reviews an output and produces a revision informed by the critique. The actor/critic split matters: a model grades its own work generously, while a critic role with explicit instructions — find the gaps, the redundancies, the unverifiable steps, and do not rewrite, do not praise — produces critique you can act on. The Reflexion paper (see references) formalized the deeper point: verbal self-feedback works best when grounded in real signals from the environment, not just the model re-reading itself.

Scout reflects on the plan, not the final answer — deliberately. A plan is small, structured, and nothing has been spent executing it: a critique that adds a missing subtopic costs one cheap revision. The same critique after execution costs a re-run of every search the bad plan caused. Criticize upstream, where fixes are cheap.

How many iterations? One. The first critique catches the structural misses — the absent subtopic, the two steps that are really one, the step whose “expected output” nothing could verify. A second pass mostly rewords the first. I cap reflection at one iteration in Scout — beyond that you’re paying double for synonyms. Uncapped reflection fails in predictable ways: self-congratulation (the critic blesses the draft and adds nothing), paraphrase loops (each revision restates the last with fresher vocabulary), and the unbounded version — “loop until the critic is satisfied” — an infinite loop with an API bill, because the critic’s satisfaction is not a stop criterion you control. Scout’s stop criterion is a counter in graph state, plan_iterations, checked by a routing edge. Here is the whole module as one graph — exactly what you build in the lab:

flowchart LR
    Q([question]) --> P[planner]
    P -->|"iteration 1: draft plan"| C[critic]
    C -->|critique| P
    P -->|"iteration 2: revised plan"| A[agent]
    A -->|tool call| T[tools]
    T -->|observation| A
    A --> E([cited answer])

One planner node, visited twice: the conditional edge sends iteration 1 to the critic and iteration 2 to the executor. The executor is Module 2–3’s ReAct loop, untouched.

Hands-on lab: build it

Objective, in one sentence: add a Planner node that turns question into a structured, Pydantic-validated research plan, add a one-iteration reflection loop, and measure the improvement on five test questions. The full lab lives in module-04/ of the labs repo.

Observable result: uv run python -m scout.run "How did EU AI regulation evolve in 2025?" prints the draft plan, the critique, and the revised plan, then runs the familiar ReAct executor with the plan in its context. uv run python compare_plans.py prints a before/after table on five questions.

Step 1 — The plan schemas (frozen from here on)

Everything starts with the contract. In scout/planner.py:

from pydantic import BaseModel, ValidationError

class PlanStep(BaseModel):
    """One verifiable unit of research work. Frozen course-wide in module 04."""
    id: int
    goal: str
    search_queries: list[str]
    expected_output: str

class ResearchPlan(BaseModel):
    """The Planner's structured output. Frozen course-wide in module 04."""
    objective: str
    steps: list[PlanStep]
    open_questions: list[str]

(Pydantic, if it’s new to you, is the standard Python validation library: a BaseModel parses JSON into a typed object, and a failed parse raises a ValidationError naming every bad field.)

These two models are frozen for the rest of the course: the human who approves a plan in Module 9 and the API that returns one in Module 10 consume exactly these fields. expected_output is the field that makes a step verifiable — it names the evidence that completes the step, and it’s the first thing the critic checks.

Step 2 — The state grows by exactly two fields

ScoutState follows the course’s field calendar — added to, never reshaped. Module 4 adds the plan and the loop’s counter:

class ScoutState(TypedDict):
    question: str
    messages: Annotated[list, operator.add]
    # Module 04: the Planner's structured output, and the reflection-loop
    # counter that drives the planner -> critic -> planner routing.
    plan: ResearchPlan | None
    plan_iterations: int

plan_iterations in state — rather than a variable inside some function — is stateful orchestration in one line: the loop’s progress is part of the run’s inspectable record, readable by routing edges, tests, and you at 2 a.m. The smoke test pins the state to exactly these four fields.

Step 3 — The Planner node, and the error path you keep

The Planner prompts for JSON only and validates the response against the schema. The interesting part is the failure path — explicit, taught, and capped at one retry:

def _generate_plan(prompt: str) -> ResearchPlan:
    raw = llm.complete(prompt, system_prompt=PLANNER_SYSTEM,
                       max_tokens=config.MAX_TOKENS)
    try:
        return ResearchPlan.model_validate_json(_extract_json(raw))
    except ValidationError as exc:
        retry = (f"{prompt}\n\nYour previous attempt failed validation:\n{exc}\n"
                 "Fix those exact errors and return ONLY the corrected JSON object.")
        raw = llm.complete(retry, system_prompt=PLANNER_SYSTEM,
                           max_tokens=config.MAX_TOKENS)
        return ResearchPlan.model_validate_json(_extract_json(raw))

The retry feeds the model the exact Pydantic error — models fix named mistakes far more reliably than “try again.” A second failure raises and stops the run: executing a malformed plan is worse than not executing. (_extract_json slices the outermost {…} block — reasoning models like to wrap JSON in prose. Some OpenAI-compatible endpoints accept a JSON-schema response_format; support varies by model and provider, so the lab takes the path that works everywhere.)

The node reads plan_iterations to pick its mode — 0 means draft; otherwise revise against the critique found in the transcript — and every pass returns plan_iterations + 1: that counter is what the routing edge in Step 4 reads. On its final pass it appends the rendered plan to messages: that’s the handover to the executor, which follows the plan as context. The handover message also caps the executor at three broad web searches — the plan says what to find, not how many searches to run, and the whole run still has to fit under MAX_ITERATIONS = 6. Step-by-step orchestration of individual plan steps is Module 7’s job.

Step 4 — The critic, and the loop that stops

The critic reads state["plan"] and writes a short, prefixed critique message — it never rewrites the plan:

def critic_node(state: "ScoutState") -> dict:
    critique = llm.complete(
        f"Research question: {state['question']}\n\n"
        f"Draft plan:\n{state['plan'].model_dump_json(indent=2)}",
        system_prompt=CRITIC_SYSTEM,
        max_tokens=config.MAX_TOKENS,
    )
    return {"messages": [{"role": "user",
                          "content": f"{CRITIQUE_PREFIX}\n{critique}"}]}

Why role: "user" for LLM-generated text? Because for the next LLM call, the critique — like the plan handover in Step 3 — is an instruction to follow, not one of the model’s own past turns. And the role is load-bearing: route_after_agent counts only "assistant" turns against MAX_ITERATIONS = 6, so critique and handover must not consume the executor’s budget.

That prefix is what makes the critique findable again: in revision mode, the Planner locates Step 3’s “critique in the transcript” by taking the latest message that starts with CRITIQUE_PREFIX. The critic’s system prompt bans praise and demands 3–6 bullets naming missing subtopics, redundant steps, and unverifiable expected outputs. The loop closes with one conditional edge reading the counter:

def route_after_planner(state: ScoutState) -> str:
    if state["plan_iterations"] < config.MAX_PLAN_ITERATIONS:
        return "critic"
    return "agent"

MAX_PLAN_ITERATIONS = 2 joins the other caps in config.py: the draft plus exactly one revision. Never “until the critic is satisfied.”

Step 5 — Wire it and run it

build_graph() puts the deliberation in front of the untouched executor:

builder.add_edge(START, "planner")
builder.add_conditional_edges("planner", route_after_planner,
                              {"critic": "critic", "agent": "agent"})
builder.add_edge("critic", "planner")
# agent -> tools -> agent: the module-02/03 ReAct loop, unchanged

The CLI moves to scout/run.py, the canonical entry point from now on:

cd module-04
uv run python -m scout.run "How did EU AI regulation evolve in 2025?"
=== Plan v1 (draft) ===
Objective: Trace how EU AI regulation changed during 2025
  1. Identify the AI Act milestones that took effect in 2025
     queries: EU AI Act 2025 implementation timeline, ...
     expected: dated list of provisions entering into force
  ...

=== Critique ===
Plan critique (address every point in your revision):
- No step covers enforcement actions, only the rules themselves
- Steps 2 and 3 overlap: both search for guidance documents
  ...

=== Plan v2 (revised) ===
...
[agent] tool_call: web_search({"query": "..."})
[tools] 5 results
...cited answer streams here...

Step 6 — Measure the reflection pass

Still inside module-04/:

uv run python compare_plans.py

The script runs draft → critique → revise on the five questions in test_questions.json (no executor, no search key needed) and prints three counting heuristics per plan — steps, distinct queries, steps with an expected output — plus a guided inspection checklist:

 Q |    steps     |  distinct queries  | steps w/ expected
---------------------------------------------------------
 1 |    4 -> 5    |       7 -> 9       |      4 -> 5
 2 |    4 -> 4    |       8 -> 8       |      4 -> 4
 ...

Numbers count; they don’t judge. Question 2’s unchanged row could be a solid draft — or a critic that paraphrased. The checklist makes you look: did the revision add a missing subtopic, merge redundant steps, sharpen expected outputs — or just reword? There’s deliberately no LLM-as-judge here; automated quality judging is Module 8’s topic. The budget is ~15 LLM calls (the free tier allows 40 req/min as of June 2026; llm.complete() backs off on 429).

Step 7 — Verify

cd ..                                                # back to the repo root
uv run pytest module-04/tests/                       # offline, no API calls
SCOUT_LIVE_TESTS=1 uv run pytest module-04/tests/    # + 1 real graph run

And the cumulative rule of the repo, from the root:

uv run pytest module-01/tests/ module-02/tests/ module-03/tests/ module-04/tests/

Try it yourself (no solution provided):

  1. Add a criticality signal: give PlanStep an optional criticality: int = 1 field in your copy and teach the Planner prompt to set it. The smoke test pinning the frozen schema will object — keep the experiment on a branch, and notice why the course freezes contracts.
  2. Pay for a second opinion: set MAX_PLAN_ITERATIONS = 3 and re-run compare_plans.py. Compare what v3 adds over v2 against what v2 added over v1, and what it cost. Diminishing returns, measured on your own runs. Same caveat as exercise 1: the smoke test pins MAX_PLAN_ITERATIONS == 2, so branch and revert after.

Exam corner

What the exam tests here. Per the official study guide, Domain 5 (10%) expects you to: apply reasoning frameworks — chain-of-thought and task decomposition (5.2); engineer planning strategies for sequential and multi-step decision-making (5.3); manage stateful orchestration to coordinate complex tasks (5.4 — this module’s half; the “knowledge retention” half belongs to memory); and adapt reasoning strategies based on experience and feedback (5.5). Objective 5.1 — memory mechanisms — is the other half of this domain and is covered in Module 5. Questions are scenarios: a task with constraints in, “choose the right reasoning or planning approach” out.

Quiz — answers after question 5.

  1. A pharma company’s research agent compiles regulatory submissions. Compliance requires that every research step be reviewed and signed off by a human before any data is gathered. Which approach fits?

    • A) ReAct, so the agent can adapt its searches as it learns
    • B) Plan-and-execute: draft the full plan up front, pause for review, then execute the approved steps
    • C) Chain-of-thought prompting before each individual tool call
    • D) More reflection iterations, so the plan improves itself before acting
  2. An agent must research “the impact of remote work on commercial real estate, public transit, and downtown retail.” Which decomposition is correct?

    • A) One step: “research the impact of remote work thoroughly”
    • B) Three steps split by sub-domain, each with its own queries and a checkable expected output, plus a final synthesis step
    • C) Steps split by tool: one step per search engine the agent can use
    • D) Twelve steps, several of which repeat the same queries with different wording, to guarantee coverage
  3. A team runs six self-critique iterations on every draft plan. Quality scores plateau after the first pass; cost roughly doubles with each additional one. What’s the right adjustment?

    • A) Increase to ten iterations — quality will eventually improve
    • B) Replace the critic with a larger, more expensive model
    • C) Cap reflection at one iteration and ground further corrections in external signals (tool results, validation errors)
    • D) Remove the critique entirely — reflection never adds value
  4. A model’s output reads: “Thought: I need the 2025 figure. Action: web_search(‘EU AI Act fines 2025’). Observation: three results… Thought: now I can compare.” What is this?

    • A) Chain-of-thought — the output shows explicit thoughts
    • B) ReAct — reasoning interleaved with actions and observations
    • C) Plan-and-execute — the steps were decided up front
    • D) Reflection — the model is critiquing its own reasoning
  5. A plan-and-execute agent fails at step 4 of 6: the source it needed is offline. To replan only the remaining work without redoing steps 1–3, what must the system have kept?

    • A) Nothing — restarting from scratch is always cleaner
    • B) Structured run state: the plan, which steps completed with their outputs, and the failure that interrupted step 4
    • C) The original user prompt, which contains everything needed
    • D) The model’s internal reasoning tokens from the failed step

Answers. 1 — B. “Sign off before any data is gathered” requires an artifact that exists before execution — only plan-and-execute produces one. A starts acting immediately, with nothing to approve; C reasons without producing a reviewable plan; D improves a plan but never pauses for the required human. 2 — B. Good decomposition produces independently verifiable subtasks that jointly cover the question, plus synthesis. A isn’t a decomposition; C splits by mechanics instead of meaning; D’s redundant steps are the token-burning spiral this module opened with. 3 — C. Plateauing quality with compounding cost is the signature of ungrounded reflection. The fix is the cap plus external feedback — A and B pay more for the same plateau; D throws away the genuinely valuable first pass. 4 — B. Action: and Observation: lines are the ReAct signature — reasoning and acting. Pure chain-of-thought never touches a tool. 5 — B. Replanning the remainder requires the plan, the completed steps with their results, and the failure — exactly what stateful orchestration keeps in graph state. A wastes three steps of paid work; C and D hold no record of execution progress.

Traps to avoid:

  • CoT vs. ReAct. Chain-of-thought reasons; ReAct reasons and acts through tools. Visible thoughts don’t make a transcript ReAct — actions and observations do.
  • “Plan-and-execute always beats ReAct because it plans.” It buys auditability and approval points, and pays for them in adaptability and upfront calls. On narrow, unpredictable tasks, interleaved ReAct wins.
  • Reflection as fact-checker. Self-critique without external feedback cannot fix factual errors — the model only sees what it already knows. Reflection reshapes; tools, validation, and evals verify.

Key takeaways

  • Reactive systems act on the current observation; deliberative systems plan first; hybrids — like Scout from today — plan deliberately and execute reactively.
  • Decompose before acting: independently verifiable subtasks with expected outputs are what make coverage checkable and spirals avoidable.
  • Chain-of-thought reasons without acting; ReAct interleaves reasoning with tool actions — the exam tests the boundary, transcripts in hand.
  • ReAct is adaptive but opaque; plan-and-execute is auditable and approvable but rigid; production agents hybridize.
  • Reflection earns its cost once: critique the plan (cheap to fix), cap the loop with a counter in state, and never loop “until satisfied.”
  • The plan lives in graph state, not in a prompt — stateful orchestration is what makes replanning, approval, and debugging possible.
  • Reasoning has a budget — thinking tokens and deliberation calls — and spending it is a decision, not a default.

Keep going

Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.

Scout can now plan — but it forgets everything between sessions. Module 5 gives it memory: checkpointers, persistence, and long-term recall.

Lab code · Course index · ← Module 3 · Module 5 →

References