Agent Architecture: Patterns, Trade-offs, and When Not to Build an Agent (NCP-AAI Module 3)
This is Module 3 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.
Last year I reviewed a system that summarized support tickets: six agents in a peer-to-peer swarm — a triager, two summarizers, a tone checker, a router, a “quality agent”. It worked, occasionally. Latency ran to minutes; token spend was roughly ten times what the task needed. And when a summary came out wrong, nobody could say which agent had decided what — control had hopped between peers five times with no trace. The replacement, shipped a month later: a three-step prompt chain. Faster, cheaper, debuggable in one read.
Nobody on that team was a bad engineer. They’d just never been asked the question this module drills: does this need an agent at all? The exam asks it relentlessly — Agent Architecture and Design is 15% of your score, tested in scenarios. By the end you’ll answer like an architect: pattern named, trade-offs weighed, decision written.
In this module
- You’ll learn:
- Name and compare the canonical patterns: prompt chain, router, single agent, supervisor, hierarchy, swarm.
- Apply a trade-off grid — latency, cost, reliability, auditability, scalability — to choose a pattern in a scenario, the way Domain 1 questions ask.
- Decide workflow vs. agent, and identify the anti-patterns of over-agentified systems.
- Write an architecture design doc — the artifact that survives the meeting.
- You’ll build: Scout’s architecture design doc — and a refactor of the Module 2 graph to match the target architecture, without changing its behavior.
- Exam domains covered: D1 — Agent Architecture and Design — 15% of the exam.
- Prerequisites: Modules 1–2 (your ReAct graph runs); NVIDIA API key configured.
Where you are
- ✅ Module 1 — What Is Agentic AI? — vocabulary, landscape, first NIM call
- ✅ Module 2 — Build Your First AI Agent — ReAct loop, tool calling, first graph
- 👉 Module 3 — Agent Architecture (you are here)
- ⬜ Modules 4–14 — cognition, memory, RAG, multi-agent, evals, guardrails, deployment, the exam
Scout before: a single-agent ReAct loop with web_search, everything in one
file. Scout after: same behavior — but the graph is reorganized along the pattern
it will grow into, and a design doc traces the route to Module 13. The only module
whose main increment is a document, deliberately: the exam tests this domain in
scenarios, not code.
Workflows first: chains and routers
Module 1 placed systems on a spectrum from a single LLM call to a multi-agent team. Architecture starts at the humble end, which wins more often than demos suggest.
Prompt chaining is the simplest composition: the output of one LLM call becomes
the input of the next, and the order of steps is fixed in your code. The model
decides words; your code decides what happens next — always. Routing adds one
decision: a classifier (often a small, cheap model) dispatches the input to one of
several specialized paths — each path itself a call or a chain. And when steps must
share evolving data — accumulated results, retries, verdicts — you add stateful
orchestration: a typed state object threaded through the steps. You know this one:
it’s LangGraph’s StateGraph from Module 2, minus the
loop. The study guide files these under objective 1.6, with logic trees —
branching if/else structures in code, exactly what a router’s dispatch table is.
Why use these when agents exist? Three properties agents can’t match:
- Determinism: the same input takes the same path. You can unit-test a chain.
- Predictable cost and latency: three steps means three calls. Always.
- Trivial debugging: a linear trace; the failing step has a name.
Concrete numbers: as a chain, that ticket pipeline (extract → summarize → format)
is exactly 3 calls per ticket, every time. The six-agent version averaged 19 calls,
and no two runs took the same path.
The shapes you’re choosing between:
flowchart LR
subgraph chain ["Prompt chain"]
direction LR
A1[call 1] --> A2[call 2] --> A3[call 3]
end
subgraph router ["Router"]
direction TB
B0{classifier} --> B1[path A]
B0 --> B2[path B]
end
subgraph agent ["Single agent"]
direction LR
C1[LLM decides] -->|tool call| C2[tools]
C2 -->|observation| C1
end
subgraph sup ["Supervisor"]
direction TB
D0[supervisor] <--> D1[specialist A]
D0 <--> D2[specialist B]
end
The pattern gallery, left to right in order of increasing autonomy — and increasing cost, latency, and failure modes.
The first two are workflows: code controls the flow. The moment the model starts controlling the flow, you’ve crossed into agent territory.
The single agent: ReAct as an architecture decision
In Module 2 you built a ReAct loop and learned its mechanics. Strip the mechanics
away and look at what you actually decided: in that loop, the LLM controls the
flow — whether to call a tool, which one, how many times, when to stop. That, not
the presence of an LLM or a while loop, defines an agent, and the exam leans on
that boundary hard.
The study guide’s job description expects three system temperaments. A reactive system decides its next action from the current observation — sense, act, repeat; ReAct is reactive by construction. A deliberative system plans first — a multi-step plan toward a goal, then execution; plan-and-execute agents are the LLM incarnation, and Scout grows one in Module 4. A hybrid system layers both — where Scout ultimately lands, like most production agents.
When is a single reactive agent the right architecture, not a stepping stone?
- The task is open-ended — steps can’t be enumerated in advance — but lives in one domain.
- The toolbox is small: under roughly ten tools, the model picks well from descriptions; past that, selection degrades and schemas crowd the context.
- No parallelism required: one transcript, one thing at a time.
- One context window holds the working state of the whole task.
Scout checks every box — one domain, one tool, sequential work. That’s why it stays a single agent until Module 7: the question “when does it stop sufficing” gets answered with criteria, not vibes.
Scaling out: supervisor, hierarchy, swarm
When a task outgrows one agent — distinct skills, parallel work, crowded contexts — there are three canonical ways to arrange a team. The exam tests the taxonomy and trade-offs here; the build comes in Module 7.
A supervisor is a central agent that routes work to specialist agents and collects their results: it decides who acts next; the specialists decide how to act. Every handoff passes through one point — one place to log, audit, and intervene. Typical use: a pipeline with distinct stages, exactly Scout’s future (a supervisor coordinating Searcher, Reader, Fact-checker, and Writer). Terminology, because the exam is picky: supervisor names the agent; orchestration names the concept of coordinating multi-agent workflows — there is no “orchestrator agent” in the blueprint’s vocabulary.
A hierarchy is supervisors of supervisors: each subtree owns a domain, a top-level supervisor delegates between subtrees. It exists for genuinely large systems — tens of specialists, organized like departments — and two levels of routing means two levels of latency, cost, and misrouting: it earns its keep only at scale.
A swarm has no chief: agents hand control directly to each other, peer to peer. It shines when a conversation should flow between personas without a dispatcher — support migrating between a billing and a technical specialist. The price is steep: no single point sees the whole run, and tracing “who decided what” means stitching together every peer’s view.
The reference table to internalize for scenario questions:
| Pattern | Who controls the flow | Latency | Cost | Debuggability / auditability | Typical use |
|---|---|---|---|---|---|
| Prompt chain | Your code, fixed order | Lowest, constant | N calls, fixed | Linear trace — trivial | Fixed multi-step transforms |
| Router | Your code + one classifier call | Low, constant | Fixed + 1 cheap call | One labeled branch decision | Triage into specialized paths |
| Single agent | The LLM (loop, tool choice) | Variable | Variable, capped | One transcript to read | Open-ended task, one domain, few tools |
| Supervisor | Supervisor LLM routes to specialists | High (every hop bills a routing call) | High | Central choke point — best multi-agent audit story | Distinct specialist stages + traceability |
| Hierarchy | Supervisors of supervisors | Highest | Highest | Auditable per level; deep traces | Tens of specialists, domain subtrees |
| Swarm | Peers hand off to peers | Variable | High | Hardest — no central view | Fluid persona handoffs, no natural chief |
We implement the supervisor pattern hands-on in Module 7. Here, what matters is reading a scenario and naming the right row.
Choosing: the trade-off grid
Scenario questions hand you constraints and four plausible architectures. Score them against five criteria — the same five your design docs argue from.
Latency. Every LLM call in the critical path adds seconds. A chain has a constant number; an agent a variable number; a supervisor bills a routing call on every hop. A tight SLA works against every added agent.
Cost. Tokens × calls. The multipliers compound quietly: more agents means more calls, but also more context per call — each specialist re-reads its instructions and state, and reasoning models spend thinking tokens on every turn.
Reliability. LLM steps compose their error rates: if each step is right 95% of the time, five dependent steps are right about 77% of the time (0.95⁵). Every added agent is another decision point, and errors travel — a bad handoff upstream becomes confident nonsense downstream. Fewer dependent LLM decisions is a reliability strategy.
Auditability. Who decided what, when, from what evidence? A chain logs itself. A supervisor concentrates every decision at one inspectable point. A swarm scatters the story across peers. If the scenario mentions compliance or “explain the decision,” weight this criterion heavily — it usually decides the answer.
Scalability and adaptability — objective 1.8 by name. Not “does it handle load” but “can the architecture absorb the next capability without a rewrite?” Adding a specialist to a supervisor is a new node plus a routing rule; adding anything to a do-everything agent means re-tuning one giant prompt. The lab’s refactor prepares exactly this: Scout’s next organs must plug in, not bolt on.
One more decision belongs to architecture, though its implementation waits for Module 6: what shape is your knowledge in? A vector store retrieves by semantic similarity — the right default for unstructured documents. A knowledge graph stores entities and typed relationships, which is what relational reasoning needs: multi-hop questions like “which services depend on a library that depends on X?” follow edges, not embeddings — similarity search can’t hop. Related entities, dependencies, multi-hop → knowledge graph; find relevant passages → vectors (objective 1.7; Scout builds its retrieval in Module 6).
The surface your agent presents — chat, API, or a human approval point — is an architecture decision too (objective 1.1): pausing for plan approval needs an interruptible graph, plumbing you design for, not sprinkle on. Scout’s interrupt arrives in Module 9, its API in Module 10.
With constraints on the table, the decision falls out of a few questions:
flowchart TD
Q1{"Can you write the steps<br/>down in advance?"} -->|yes| W["Workflow — chain,<br/>router. No agent."]
Q1 -->|no| Q2{"One domain,<br/>< ~10 tools,<br/>no parallel work?"}
Q2 -->|yes| SA["Single agent (ReAct)"]
Q2 -->|no| Q3{"Distinct specialist stages?<br/>Central audit/control point<br/>needed?"}
Q3 -->|yes| SUP["Supervisor<br/>(hierarchy if subtrees<br/>of specialists)"]
Q3 -->|no| SW["Swarm — only if fluid<br/>peer handoffs ARE<br/>the requirement"]
Walk it top to bottom, stop at the first fit. Most real systems stop in the first two boxes.
Anti-patterns: when not to build an agent
Four failure shapes account for most agentic wreckage — and most wrong-but-tempting options in exam questions.
Agent-washing. A while loop around an LLM call isn’t an agent — and neither is
an if after one. The term cuts both ways: calling a plain workflow an “agent”, or —
the expensive direction — building an agent, with its variable cost and
non-determinism, for a task whose steps you could have written down in advance. The
test is always: who controls the flow? If your code does, it’s a workflow; name it
and ship it proudly.
Premature multi-agent. Reaching for a specialist team before one agent has actually hit a wall. The coordination tax is real: routing calls, duplicated context, handoff failures, compounded errors. If the specialists never disagree and never run in parallel, they’re one agent wearing different hats — at six times the price. Scale out on evidence of the single-agent ceiling, not because the diagram looks better.
Autonomy without observability. An agent that acts but leaves no trace is undeployable — when (not if) it misbehaves, you have nothing to debug and nothing to show the auditor. No trace, no prod. Scout gets tracing in Module 11; the decision that traces must exist is made here, in the design doc.
The do-everything agent. Thirty tools, one prompt trying to be researcher, coder, and support rep at once. Tool selection degrades, the context saturates with schemas, every new capability makes every old one slightly worse. The opposite failure from premature multi-agent — same fix: match structure to the task’s seams.
Hands-on lab: build it
Architecture is tested in scenarios on the exam, not in code — so this lab makes you
do architecture: write Scout’s design doc, then refactor the Module 2 graph to
match it. The full lab lives in
module-03/
of the labs repo.
Objective: produce docs/scout-design.md and reorganize the code so Scout can
grow toward its target without a rewrite.
Observable result: the design doc is filled in, and uv run python -m scout.graph "…" behaves exactly like Module 2 — same trace, same kind of cited answer, all
smoke tests green including Modules 1–2’s.
Step 1 — Write the design doc
The template (docs/scout-design-template.md), deliberately short:
# Architecture Design Doc — <system name>
1. Problem & success criteria — what must work, measurably
2. Pattern chosen & alternatives rejected — the decision AND the road not taken
3. State & data contracts — what's frozen, what may change
4. Failure modes & mitigations — failure → detection → mitigation
5. Evolution path — how it grows WITHOUT a rewrite
Fill it for Scout’s target: a supervisor coordinating Planner, Searcher, Reader,
Fact-checker, and Writer over shared state. Justify the pattern with this module’s
grid — the exercise has teeth: auditable end to end kills the swarm, distinct
specialist stages outgrow a single agent at Module 7, unpredictable control flow
kills the fixed pipeline. Section 2’s rejection table is the most valuable thing
you’ll write this module — it stops the next person from re-litigating the
architecture. A filled reference ships in the repo (docs/scout-design.md); compare
after you’ve written yours, not before.
Step 2 — Refactor: make the code match the doc
Module 2’s graph works, but one file holds everything: nodes, prompt, routing, construction, CLI. None of that blocks today’s agent; all of it blocks the evolution path you just wrote. Three moves, no behavior change:
Move 1 — call parameters into config.py. The rule was already “the model name
lives only in config.py”; the refactor extends it to every call parameter:
# module-03/scout/config.py (excerpt)
MAX_TOKENS = 8192 # Nemotron 3 reasons before answering — keep headroom
MAX_ITERATIONS = 6 # hard stop: an agent loop without a cap is a token furnace
MAX_RETRIES = 2 # capped: past 2 you're burning tokens on a broken loop
Move 2 — node logic into nodes.py (new). agent_node, the streaming collector,
tools_node, and the system prompt move out of graph.py. The dividing line: a
node decides WHAT one step does; the graph decides WHO runs WHEN.
Move 3 — graph.py keeps only construction. Routing and wiring, nothing else:
# module-03/scout/graph.py (after — node logic now imported, not defined)
from .nodes import SYSTEM_PROMPT, agent_node, tools_node
def route_after_agent(state: ScoutState) -> str:
"""Routing is a graph concern: nodes never decide who runs next."""
llm_turns = sum(m["role"] == "assistant" for m in state["messages"])
last = state["messages"][-1]
if last.get("tool_calls") and llm_turns < config.MAX_ITERATIONS:
return "tools"
return END
def build_graph():
builder = StateGraph(ScoutState)
builder.add_node("agent", agent_node) # canonical name: the future
builder.add_node("tools", tools_node) # base of the M7 specialists
builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", route_after_agent, {"tools": "tools", END: END})
builder.add_edge("tools", "agent")
return builder.compile()
build_graph() becomes the single public constructor — tests call it, Module 10’s
API will call it, nothing ever wires its own copy. One hard rule, enforced by the
smoke tests: ScoutState gains no fields. The refactor reorganizes; it does not
extend. If your refactor needs a new field, your refactor is wrong.
Untouched, deliberately: state.py (frozen contract), llm.py, ask.py, and
react_manual.py — byte-identical as a Module 2 comparison piece. A museum exhibit;
the living architecture reads config.py.
Step 3 — Verify: same behavior, better bones
cd module-03
uv run python -m scout.graph "What did NVIDIA announce at GTC 2026?"
uv run pytest module-01/tests/ module-02/tests/ module-03/tests/ # from repo root
Same trace as Module 2, all three suites green — Module 3’s tests pin the new layout (nodes/graph separated, parameters only in config, design doc filled) and the non-change (state untouched, inherited files byte-identical). A refactor that breaks an old test changed behavior somewhere.
Try it yourself (no solution provided):
- Stress the evolution path on paper: list which files Module 4’s Planner will touch
(a
plannernode,plan+plan_iterationsfields). If your answer edits an existing function innodes.py, revisit step 2. - Add one row to your design doc’s rejection table — a hierarchy of supervisors — and reject it for today’s Scout in two sentences, in the grid’s vocabulary.
Exam corner
What the exam tests here. Per the official study guide, Domain 1 (15%) expects you to: implement reasoning-and-action frameworks like ReAct (1.2 — built in Module 2, placed in the taxonomy here); orchestrate multi-agent workflows (1.5 — supervisor vs. hierarchy vs. swarm); apply logic trees, prompt chains, and stateful orchestration (1.6 — and know when they beat an agent); integrate knowledge graphs for relational reasoning (1.7); and ensure the architecture’s adaptability and scalability (1.8). Questions are scenarios: constraints in, “choose the best architecture” out.
Quiz — answers after question 5.
-
Every Monday, a system pulls last week’s sales records, validates them against three fixed rules, and produces a report. The steps never vary. Best architecture?
- A) A single ReAct agent with data-access tools, to adapt to data issues
- B) A supervisor with extraction, validation, and reporting specialists
- C) A prompt chain: extract → validate → report, orchestrated by code
- D) A swarm, so validation and reporting can hand off flexibly
-
A bank’s loan-document system has distinct stages (classify, extract, cross-check, summarize); compliance requires a centralized audit trail of every decision and handoff. Best topology?
- A) A swarm — peer handoffs maximize flexibility between stages
- B) A supervisor routing to stage specialists, all handoffs through one point
- C) A single agent with 25 tools covering all four stages
- D) Four independent chains in parallel, no coordination
-
A team built five agents for FAQ answering: query, rephrase, search, answer, checker. It’s slow, expensive, and wrong answers are hard to localize — each agent builds on the previous one’s output. Diagnosis?
- A) Too few agents — a sixth should verify the checker
- B) Premature multi-agent: sequential specialists that never diverge are a pipeline, and errors compound across hops — collapse to a router plus retrieval, or a single agent
- C) Wrong framework — swapping frameworks would fix the latency
- D) The model is too small for five agents
-
An assistant must answer within a 3-second SLA on a fixed token budget; the task is open-ended within one domain and needs 3 tools. Which fits?
- A) A supervisor with three single-tool specialists, for cleaner separation
- B) A hierarchy, so future growth is already structured
- C) A single ReAct agent with the 3 tools and a tight iteration cap
- D) A swarm of two agents to parallelize tool calls
-
A system answers “which downstream services are affected if component X fails?” over thousands of interconnected components — multi-hop traversal across explicit dependency relationships. Which knowledge foundation?
- A) A vector store — embeddings capture the dependencies implicitly
- B) A bigger context window — paste the full dependency list per query
- C) Fine-tuning the model on the dependency data
- D) A knowledge graph — typed relationships support multi-hop relational reasoning that similarity search can’t follow
Answers. 1 — C. Fixed, enumerable steps are the signature of a workflow: deterministic, testable, three calls every time. A and B re-introduce variable cost and non-determinism for zero benefit — the trap is that they sound more capable. D is this module’s opening anecdote. 2 — B. “Centralized audit trail” decides it: a supervisor passes every handoff through one loggable point. A swarm scatters control — the worst audit story of all. C saturates one context with 25 tools; D has no end-to-end trail at all. 3 — B. Five sequential specialists whose routing never varies are a prompt chain wearing agent costumes — five compounding error points, five context bills. The fix is simplification, not another checking layer (A adds one more error point). 4 — C. The SLA and budget eliminate every multi-agent option — each supervisor hop bills a routing call. One domain + 3 tools is single-agent territory; the iteration cap makes worst-case latency predictable. 5 — D. Dependency questions are relational, multi-hop reasoning — you follow typed edges, not similarity. Vectors find similar text, not connected entities; B doesn’t scale; C bakes today’s dependencies into stale weights.
Traps to avoid:
- The impressive-architecture trap. Scenarios often include a plausible multi-agent option satisfying most constraints. The intended answer satisfies all of them, as simply as possible — complexity is never a tiebreak in its own favor.
- Supervisor vs. orchestration. Supervisor = the coordinating agent. Orchestration = the concept of coordinating multi-agent workflows. There’s no “orchestrator agent” in the blueprint’s vocabulary.
- “It calls an LLM, so it’s an agent.” An agent is defined by the LLM controlling
the flow — looping, choosing tools, deciding when to stop. An LLM call inside an
ifis a workflow step. “Agentic” systems described as fixed pipelines are testing exactly this boundary. - “Scalable architecture” ≠ handles more load. Objective 1.8’s scalability means absorbing the next capability without a rewrite — a new specialist node, not more requests per second. Load belongs to Domain 4.
- Vectors find similar text; knowledge graphs follow typed edges. A multi-hop, relational question (“which services depend on…”) is a graph question — no amount of top-k tuning makes similarity search hop.
Key takeaways
- Workflow first, agent if necessary: if you can write the steps down in advance, a chain or router beats an agent on cost, latency, determinism, and debuggability.
- An agent is defined by who controls the flow — the LLM looping and choosing actions — not by the presence of an LLM call.
- Know the six patterns and their trade-offs — and that supervisor wins when auditability matters.
- Every added agent adds latency, cost, and a failure mode — LLM error rates compose across dependent steps, and handoffs compound them.
- Reactive systems act on observations (ReAct); deliberative systems plan first; hybrids layer both — exam vocabulary, straight from the job description.
- Multi-hop relational questions need a knowledge graph; “find relevant passages” needs a vector store — a decision made before any RAG code exists.
- The exam rewards the simplest architecture that meets all the constraints — and a design doc with rejected alternatives is the artifact that scales.
Keep going
Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.
Scout has an architecture on paper; next, we give it a brain — planning, reasoning, and self-correction.
Lab code · Course index · ← Module 2 · Module 4 →
References
- What Are Multi-Agent Systems? — NVIDIA glossary; official study-guide reading for Domain 1, with the centralized/decentralized/hierarchical orchestration comparison.
- Agentic AI in the Factory — NVIDIA white paper; official study-guide reading on agentic workflows as production services.
- Workflows and agents — LangGraph’s guide to prompt chaining, routing, and agents (1.x), with runnable examples of every pattern in this module.
- Building Effective AI Agents — Anthropic’s essay behind the workflows-vs-agents framing and the “start simple” discipline.
- NCP-AAI certification page — the official blueprint; Agent Architecture and Design is weighted at 15%.