Build Your First AI Agent: Tool Calling from Scratch to LangGraph (NCP-AAI Module 2)
This is Module 2 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.
Your Module 1 setup works: Nemotron answers in half a second, sounding confident. Now ask it what NVIDIA announced at GTC 2026. Either it admits its training data ends too early, or — worse — it invents a keynote. Confidently. With dates.
That’s not a flaw you can prompt away. A bare LLM is a frozen snapshot: it cannot look anything up, run anything, or check its claims against the world. It can only talk. Every “agent” demo you’ve ever seen rests on one mechanism — the model asking your code to act on its behalf, then reasoning over what came back.
Today you build that mechanism twice: by hand in about 80 lines, so you know what every piece does, then in LangGraph, so you know exactly what a framework saves you — and what it costs. By the end, Scout answers questions about last month with sources instead of guesses.
In this module
- You’ll learn:
- Implement the ReAct loop (reason → act → observe) from scratch with a raw OpenAI-compatible client.
- Define tool schemas and execute the full tool-call lifecycle: request, execution, result, continuation.
- Handle tool failures gracefully: surface errors to the model, retry with backoff, cap iterations.
- Rebuild the same agent with LangGraph primitives — StateGraph, nodes, conditional edges, state — and stream loop events and answer tokens client-side.
- Explain what a framework buys you — and recognize the deprecated patterns that exam-era tutorials still teach.
- You’ll build: Scout v0.2 — a single-agent ReAct loop with a
web_searchtool, hand-rolled first, then as your first LangGraph graph. - Exam domains covered: D2 — Agent Development — 15% of the exam.
- Prerequisites: Module 1 (NVIDIA API key,
uv, the labs repo cloned) plus a free Tavily API key — you’ll create it in the lab.
Where you are
- ✅ Module 1 — What Is Agentic AI? — vocabulary, landscape, first NIM call
- 👉 Module 2 — Build Your First AI Agent (you are here)
- ⬜ Modules 3–14 — architecture, cognition, memory, RAG, multi-agent, evals, guardrails, deployment, and the exam
Scout before: config.py and one direct call — question in, answer out, the model
decides nothing but words. Scout after: an agent that decides to call web_search,
observes, loops, and answers with fresh facts and source URLs. ScoutState — the state
object that will carry Scout to Module 13 — exists.
From One Call to a Loop: The ReAct Pattern
Why isn’t one LLM call enough? Three structural reasons. The model’s knowledge is frozen at training time, so anything recent is invisible. It cannot act — no search, no API call, no file read. And it gets no feedback: whatever it says first is final, with no chance to notice it’s wrong.
The fix is a pattern, not a bigger model. ReAct is an agent pattern that interleaves
reasoning and action: the model thinks about what it needs, takes one action, observes
the result, and reasons again — looping until it can answer. The name comes from the 2022
paper by Yao et al. (arXiv 2210.03629), where the
loop ran on parsed text — the model literally wrote Thought:, Action:, and
Observation: lines. Today you don’t parse anything: modern APIs structure the same
loop as JSON through native tool calling, the mechanism this whole module is about.
Here is one full turn of the loop you’re about to build:
sequenceDiagram
participant U as User
participant L as Your code (the loop)
participant M as LLM (Nemotron via NIM)
participant T as web_search (your function)
U->>L: "What did NVIDIA announce at GTC 2026?"
L->>M: messages + tool schemas
M-->>L: tool_calls: web_search({"query": "..."})
L->>T: execute the call
T-->>L: results (or an error string)
L->>M: messages + role:"tool" result
M-->>L: final answer, no tool_calls
L-->>U: answer with source URLs
One ReAct turn: the model never touches the tool — it asks, your code acts.
Look at who does what. The model only ever produces messages. Your code — the thing in the middle — sends the transcript, executes requests, appends results, and decides when to stop. That loop is the agent; the rest of this course is that loop growing more sophisticated.
Tool Calling: How the Model Asks Your Code to Act
Tool calling is the structured mechanism by which a model requests an action from your code: it emits a JSON request naming a function and its arguments; your code executes and returns the result. The model learns what it can request from a schema you pass on every call. Here is Scout’s first tool, exactly as the model sees it:
# module-02/scout/tools.py (the schema — the function itself is ~20 lines of httpx)
WEB_SEARCH_SCHEMA = {
"type": "function",
"function": {
"name": "web_search",
"description": (
"Search the web for current, factual information. Use this for "
"anything that may have happened after your training data, and "
"for any fact you are not certain about. Returns the top results "
"as title, URL, and content snippet."
),
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "A plain-language search query, "
"e.g. 'NVIDIA GTC 2026 announcements'.",
},
},
"required": ["query"],
},
},
}
Read that description again, because it’s the most underrated line in agent
development: the tool description is prompt engineering. The model decides when to
call web_search from that text and nothing else — it never sees your Python. When an
agent ignores a tool it should use, or hammers one it shouldn’t, the description is the
first thing to fix.
The lifecycle has four steps, and the exam tests the order:
- Request: you call the API with
tools=[...]; the model replies with an assistant message carryingtool_calls— each with anid, a function name, and JSON arguments. Note the plural: one turn can request several calls — hence the lab’sforloops. - Execution: your code dispatches each call to the matching Python function.
- Result: you append one message with
role: "tool", the matchingtool_call_id, and the result as content — exactly one such reply per tool call. - Continuation: you call the model again with the grown transcript. A reply without
tool_callsis the signal: that’s the final answer, and the loop ends.
In the data, a complete turn is five messages — worth seeing once in full, because exam
scenarios poke at exactly this id plumbing:
[
{"role": "system", "content": "You are Scout, a research assistant. ..."},
{"role": "user", "content": "What did NVIDIA announce at GTC 2026?"},
{"role": "assistant", "content": null, "tool_calls": [
{"id": "call_1", "type": "function", "function": {
"name": "web_search",
"arguments": "{\"query\": \"NVIDIA GTC 2026 announcements\"}"}}]},
{"role": "tool", "tool_call_id": "call_1", "content": "[{\"title\": \"...\", ...}]"},
{"role": "assistant", "content": "At GTC 2026, NVIDIA announced ... Sources: ..."}
]
content is null when the model only requests tools; the tool message answers
call_1 by id; the reply with no tool_calls ends the loop.
Notice what step 4 implies: every iteration rebuilds the next prompt by appending
observations to messages. That makes the ReAct loop a dynamic prompt chain — each
prompt constructed at runtime from the results of the previous step, rather than fixed
in advance. The exam’s “dynamic prompt chains” objective sounds exotic; you’re looking
at it — the loop is the entire implementation.
Three names exist for one mechanism:
| Term | Where you’ll see it | Anything different? |
|---|---|---|
| tool calling | The NCP-AAI blueprint, LangGraph docs, this course | The canonical term — use it |
| ”function calling” | OpenAI’s original API docs and older tutorials | Same mechanism, legacy name |
| ”tools” | Anthropic and NIM API parameter names | Same mechanism (tools=, tool_calls) |
One adjacent dial: tool_choice. The default "auto" lets the model decide whether to
call a tool — that decision is the agentic part. "required" forces a call, "none"
forbids one, naming a function forces that one. Scout stays on auto.
When Tools Fail: Error Handling in the Loop
A loop that calls real APIs will hit real failures, and the exam’s error-handling objective is about exactly this. Four failure modes cover most of what you’ll see:
- The tool raises. Tavily times out, the network drops, the key is wrong.
- The model sends invalid arguments. Malformed JSON, a missing field, a parameter that doesn’t exist.
- An API rate-limits you. HTTP 429 — on the tool side or on the LLM side: the NVIDIA Inference Microservices (NIM) free tier allows 40 requests per minute as of June 2026.
- The model loops without concluding. Every iteration is a billed LLM call; an agent that never decides it’s done is a token furnace.
The strategies, in the order you should reach for them:
Return the error to the model as an observation. This is the counterintuitive one.
Instead of crashing, catch the exception and send the error text back as the tool
result. Given that feedback, the model usually repairs its own call on the next
iteration — or tells the user plainly that the search failed. This contract is small
enough to show whole; it’s the lab’s execute_tool(), the dispatcher both versions
of the agent share:
# module-02/scout/tools.py (the dispatcher — the error contract in code)
def execute_tool(name: str, arguments_json: str) -> str:
"""Run one tool call and ALWAYS return a string observation."""
tool = TOOLS.get(name)
if tool is None:
return f"Error: unknown tool '{name}'. Available tools: {', '.join(TOOLS)}."
try:
arguments = json.loads(arguments_json)
result = tool(**arguments)
except (json.JSONDecodeError, TypeError) as exc:
# The model sent malformed or mismatched arguments — tell it so.
return f"Error: invalid arguments for '{name}': {exc}"
except (httpx.HTTPError, RuntimeError) as exc:
# The tool itself failed (network, HTTP status, missing key) — surface it.
return f"Error: '{name}' failed: {exc}"
return json.dumps(result)
Retry with capped exponential backoff. For transient failures like 429, wait and retry — 1 s, then 2 s, then 4 s. But cap it at 2 retries or so: beyond that you’re burning tokens and latency on a dependency that’s telling you to go away. The official study guide points to the Azure Architecture Center’s Retry pattern here, alongside the circuit breaker pattern — stop calling a dependency entirely after repeated failures, probing occasionally until it recovers. Know the concept for the exam; Scout won’t need an implementation until it has real traffic.
Cap the iterations. max_iterations is the seatbelt for failure mode four: a hard
upper bound on loop turns, after which the agent stops and says so. Both lab versions
set it to 6. The cap is the backstop, not the steering: Scout’s system prompt also says
one or two searches are usually enough — without that nudge, a reasoning model will
happily rephrase the same search six times. An unbounded loop attached to a paid API is
an incident report waiting to be filed.
What you should not do is silence anything. A bare except: pass around a tool call
doesn’t make the agent robust; it makes the model reason over a hole in its own
transcript.
Why a Framework? Your First LangGraph Graph
The hand-rolled loop works, and it’s honest work — every line visible. So why reach for
a framework? Because of what the manual version doesn’t have. Its state lives in a
local variable inside one function — nothing outside the loop can inspect or resume it.
Its routing is an if welded inside a while. None of it is reusable as the system
grows past one agent, nothing streams without extra plumbing, and when Scout later needs
to pause mid-run for a human or recover a crashed session, local variables have nothing
to offer.
LangGraph’s bet is to make those things explicit. You model the loop as a graph: a
StateGraph is a graph whose nodes read and update one shared, typed state object. A
node is a function that takes the state and returns an update. An edge declares
which node runs next; a conditional edge decides it at runtime from the state —
your if tool_calls: promoted to a declared, inspectable routing rule. Wire them up,
compile, and the framework runs the loop.
The state is the part with a future. ScoutState is born here with exactly two fields:
# module-02/scout/state.py
class ScoutState(TypedDict):
question: str
# OpenAI-format message dicts; operator.add means nodes RETURN new
# messages and LangGraph appends them — nodes never mutate state.
messages: Annotated[list, operator.add]
This state object will grow module by module — fields are only ever added, never
renamed. A planner adds plan (Module 4), retrieval-augmented generation (RAG) adds
sources (Module 6), the agent team adds claims and report (Module 7). The
Annotated[list, operator.add] is a reducer: it tells LangGraph how to merge a
node’s returned messages into the state — append, here.
The honest comparison:
| Hand-rolled loop | LangGraph | |
|---|---|---|
| State management | Local variables in one function | Typed ScoutState, shared and inspectable |
| Routing | if/else inside a while | Declared conditional edges |
| Loop control | for _ in range(MAX_ITERATIONS) | Routing decision + recursion limit |
| Persistence-ready | No — state dies with the process | Yes — a checkpointer plugs in (Module 5) |
| Streaming | Build it yourself | graph.stream() built in — used in this lab |
| Lines of code | ~80, all yours | ~150, but state, routing, and streaming are declared, not improvised |
And the graph you’ll build — same loop, drawn instead of nested:
flowchart LR
S((START)) --> A[agent<br/>LLM call with tool schemas]
A -->|tool_calls pending| T[tools<br/>hand-written executor]
T --> A
A -->|no tool_calls, or<br/>iteration cap hit| E((END))
Scout’s first graph: the conditional edge after agent is the ReAct loop’s exit door.
One sentence on what you just bought, then we build: explicit state and declarative routing are what make persistence (Module 5), human interrupts (Module 9), and the move to a multi-agent team plug-in changes instead of rewrites.
Hands-on lab: build it
Objective: write the same ReAct agent twice — raw SDK, then LangGraph — and verify
both produce the same trace. The full code lives in
module-02/ of the labs repo;
the article shows the load-bearing excerpts.
Observable result: uv run python -m scout.graph "What did NVIDIA announce at GTC 2026?" prints the loop trace, then a final answer grounded in fresh web results, with
source URLs.
Step 1 — Setup
From the repo root (uv sync is enough if you cloned after this module landed):
uv add langgraph~=1.2 langchain~=1.3 langchain-core~=1.4
(Only langgraph is imported here; the other two pins belong to the course’s frozen
stack — locked now so later modules don’t move uv.lock.)
Create a free API key at tavily.com — free tier, no credit
card — and add it to the repo-root .env next to your NVIDIA key:
NVIDIA_API_KEY=nvapi-...
TAVILY_API_KEY=tvly-...
Same rule as always: keys live in .env, which is gitignored — never in code.
Step 2 — The tool
scout/tools.py holds the function and the schema you saw above, side by side. The
function is deliberately a bare httpx wrapper — no integration package — because a
tool is just a function:
# module-02/scout/tools.py (excerpt)
def web_search(query: str, max_results: int = 5) -> list[dict]:
"""Search the web via the Tavily API and return the top results."""
config.load_env()
api_key = os.environ.get("TAVILY_API_KEY")
if not api_key:
raise RuntimeError("TAVILY_API_KEY is not set — get a free key "
"at tavily.com and add it to the repo-root .env")
response = httpx.post(
TAVILY_URL,
headers={"Authorization": f"Bearer {api_key}"},
json={"query": query, "max_results": max_results},
timeout=30.0,
)
response.raise_for_status()
return [
# Snippets only: Scout does not fetch or ingest pages until module 06.
{"title": r["title"], "url": r["url"], "content": r["content"][:500]}
for r in response.json()["results"]
]
The file’s third piece is execute_tool() — the dispatcher you already read in the
error-handling section, sitting right under the function and the schema.
Step 3 — The loop by hand
scout/react_manual.py is the whole agent in ~80 lines. It opens with the other
reliability lever of exam objective 2.1 — the system prompt:
# module-02/scout/react_manual.py (the agent's standing orders)
SYSTEM_PROMPT = (
"You are Scout, a research assistant. Use the web_search tool for current "
"events and for any fact you are not certain about. One or two searches "
"are usually enough — then answer with what you found instead of "
"searching again. In your final answer, cite the sources you used as "
"full URLs (starting with http), never as titles alone."
)
Two of those sentences exist because live runs failed without them: the search nudge you read about in the error-handling section, and the full-URL requirement that keeps answers verifiable instead of citing titles. The heart:
# module-02/scout/react_manual.py (the loop — full file in the repo)
def run_agent(question: str, client=None) -> str:
"""The ReAct loop: reason -> act -> observe, until a final answer."""
client = client or llm.get_client()
messages: list[dict] = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
]
for _ in range(MAX_ITERATIONS):
choice = call_model(client, messages).choices[0]
message = choice.message
if not message.tool_calls: # no tool requested = the final answer
if not message.content:
# Module 01's trap again: reasoning ate the whole token budget.
raise RuntimeError(
"Empty final answer "
f"(finish_reason={choice.finish_reason}) — raise max_tokens."
)
return message.content
# The model only ASKED. Recording its request, executing it, and
# answering each tool_call_id is our job — one reply per call.
messages.append(
{"role": "assistant", "content": message.content,
"tool_calls": [call.model_dump() for call in message.tool_calls]}
)
for call in message.tool_calls:
print(f"[agent] tool_call: {call.function.name}({call.function.arguments})")
observation = execute_tool(call.function.name, call.function.arguments)
print(f"[tools] {describe_observation(observation)}")
messages.append(
{"role": "tool", "tool_call_id": call.id, "content": observation}
)
return f"Stopped after {MAX_ITERATIONS} iterations without a final answer."
call_model() (not shown) is one tool-enabled completion with capped backoff on
HTTP 429, and generous max_tokens — Nemotron 3 spends reasoning tokens before
answering (Module 1 covered that trap).
Step 4 — Run it and read the trace
cd module-02
uv run python -m scout.react_manual "What did NVIDIA announce at GTC 2026?"
[agent] tool_call: web_search({"query": "NVIDIA GTC 2026 announcements"})
[tools] 5 results
At GTC 2026 (March 2026), NVIDIA announced the Nemotron 3 family ...
Sources:
1. https://...
2. https://...
That trace is the whole module in three lines: the model reasoned that GTC 2026 is
past its knowledge, acted by requesting web_search, observed five results, and
answered with citations. Ask something timeless (“What is a mutex?”) and you’ll see zero
tool calls — the decision is the model’s, which is what makes this an agent.
Step 5 — The same agent as a graph
scout/state.py defines ScoutState (shown earlier). scout/graph.py rebuilds the
loop as nodes and edges — same client from scout/llm.py, same tools, same system
prompt, imported from the manual version so the two can’t drift:
# module-02/scout/graph.py (structure — full file in the repo)
def tools_node(state: ScoutState) -> dict:
"""Execute every tool call the model just requested — our code, by hand."""
last = state["messages"][-1]
results = []
for call in last["tool_calls"]:
observation = execute_tool(call["function"]["name"],
call["function"]["arguments"])
results.append(
{"role": "tool", "tool_call_id": call["id"], "content": observation}
)
return {"messages": results}
def route_after_agent(state: ScoutState) -> str:
"""Conditional edge: tools requested -> 'tools'; otherwise -> END."""
llm_turns = sum(m["role"] == "assistant" for m in state["messages"])
last = state["messages"][-1]
if last.get("tool_calls") and llm_turns < MAX_ITERATIONS:
return "tools"
return END
def build_graph():
builder = StateGraph(ScoutState)
builder.add_node("agent", agent_node)
builder.add_node("tools", tools_node)
builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", route_after_agent,
{"tools": "tools", END: END})
builder.add_edge("tools", "agent")
return builder.compile()
Three things to notice. The while is gone — the cycle agent → tools → agent is
the loop, and the conditional edge is its exit. The max_iterations guard became a
routing decision: at the cap, the edge routes to END even if the model wanted another
tool — and main() then prints the same Stopped after 6 iterations... line as the
manual version, so the two stay comparable even on failure. And the tool executor is
our own ten lines, not a prebuilt import.
Run it on the same question and compare with step 4 — the traces match line for line.
Step 6 — Watch the loop stream
The graph gives you streaming without extra plumbing. Two stream modes at once:
"updates" emits one event per node run (the loop events); "custom" carries whatever
nodes emit through get_stream_writer() — here, every answer token as it arrives:
# module-02/scout/graph.py (main — full file in the repo)
for mode, chunk in graph.stream(initial_state(question),
stream_mode=["updates", "custom"]):
if mode == "custom":
print(chunk["token"], end="", flush=True) # answer tokens, live
continue
for update in chunk.values():
for message in update["messages"]:
... # print [agent] / [tools] trace lines
(One thing _collect() receives and deliberately drops: delta.reasoning_content,
Nemotron 3’s thinking stream — reasoning budgets are Module 4 territory.)
Run step 5’s command again: the [agent] and [tools] lines frame the run, then the
final answer types itself out token by token. That’s a dynamic conversation flow with
real-time streaming, client-side — the exam objective, covered. Serving those tokens to
end users over HTTP is a Module 10 conversation; the mechanism stays the same.
Step 7 — Smoke tests
uv run pytest module-02/tests/ # offline: contracts + loop logic
SCOUT_LIVE_TESTS=1 uv run pytest module-02/tests/ # + 2 real agent runs
The offline tests replay the whole loop against a scripted fake client — both versions,
zero network. The live tests run both agents on a real question and assert the contract:
at least one tool call, a cited non-empty answer, at most 6 LLM calls. The cumulative
rule holds: module-02/scout/ still passes Module 1’s tests.
Try it yourself (no solution provided):
- Add a second tool,
get_current_date— no arguments, returns today’s date as text. Register it inTOOLSandTOOL_SCHEMAS, ask “What day is it today?”, and watch the model pick the right tool purely from the descriptions. - Break your Tavily key in
.env, re-run the graph, and verify the agent explains the failure instead of crashing —execute_toolturning an exception into an observation.
Exam corner
What the exam tests here. Per the official study guide, Agent Development (D2, 15%
of the exam) covers: engineering prompts and dynamic prompt chains for reliability (2.1
— system prompt, tool descriptions, the loop’s accumulating messages); building and
connecting custom tools, APIs, and functions (2.3 — the heart of this module); error
handling with retry logic and graceful failure recovery (2.4); dynamic conversation
flows with real-time streaming (2.5); and evaluating and refining agent decision-making
(2.6 — at this stage: read the trace, adjust tool descriptions and iteration caps;
formal evaluation is Module 8). The reasoning-and-action framework itself (ReAct) is
objective 1.2 from the architecture domain — you’ve now implemented it.
Quiz — answers after question 5.
-
An agent’s transcript reads: (1) user question → (2) assistant message containing
tool_calls→ (3) ??? → (4) assistant message with the final answer. What happened at step 3?- A) The model executed the tool and read the result itself
- B) Application code executed the tool and appended a
role: "tool"message with the matchingtool_call_id - C) The tool result was inserted into the system prompt
- D) The transcript was reset so the model could start fresh
-
Mid-run,
web_searchraises a timeout exception. Which handling is best?- A) Let the exception propagate — failing fast prevents wrong answers
- B) Catch it and return an empty result list so the run continues smoothly
- C) Catch it, return the error text to the model as the tool result, and retry transient failures with backoff, capped
- D) Catch it and retry the tool in a loop until it succeeds
-
Your agent answers a current-events question from its training data instead of calling the
web_searchtool sitting right there intools=. The most likely cause?- A) The temperature is set too high
- B) The tool’s description doesn’t tell the model when the tool applies
- C) The model is too small to use tools
- D)
max_tokensis too low
-
An agent keeps searching — six, seven, eight tool calls — burning tokens without concluding. Which fix addresses the failure directly?
- A) Increase the context window so it can hold more results
- B) Remove the search tool so it has to answer
- C) Cap iterations and give the loop an explicit termination condition (no
tool_calls→ END) - D) Raise
max_tokensso it can think longer per turn
-
A ReAct agent starts with the working assumption that a product launched in 2025. Its first search returns documentation showing it launched in 2026. What does a correctly built ReAct agent do next?
- A) Completes its original plan, then mentions the discrepancy at the end
- B) Incorporates the observation into its next reasoning step and adjusts its next action accordingly
- C) Discards the run and restarts with a corrected assumption
- D) Stops and asks the user which year is correct
Answers.
1 — B. The model emits the request; application code executes it and replies with
role: "tool" plus the matching tool_call_id — one reply per call. (A is the classic
trap: the model never executes anything. C is wrong: results go into the conversation,
not the system prompt.)
2 — C. Surfacing the error as an observation lets the model adapt — rephrase, try
another approach, or explain the failure — and a capped backoff handles transience.
A turns a hiccup into an outage, B hides the failure (the model may conclude “no
results exist”), D is an infinite retry — a failure mode, not a strategy.
3 — B. Tool selection is driven by the schema text the model reads. A vague
description gives the model no reason to prefer the tool over its own parametric
knowledge. Temperature and model size can contribute; the description is the first,
cheapest, most likely fix.
4 — C. The direct fix for non-termination is a termination guard: max_iterations
plus an explicit stop condition — in the graph, the conditional edge routing to END.
A and D make each wander longer; B amputates the agent instead of bounding it.
5 — B. ReAct’s defining property is that each reasoning step conditions on the
latest observation — the agent adjusts mid-run rather than executing a fixed plan.
(C wastes the work; D abdicates a decision the evidence already settled.)
Traps to avoid:
- “The model executes the tool.” It emits a structured request; your code executes. If an answer option has the model running anything, it’s bait.
- “Function calling” vs “tool calling.” Same mechanism; the exam and the blueprint say tool calling.
- The dangling tool call. Every
tool_callexpects exactly onerole: "tool"reply with itstool_call_id. Forget the reply — or the id — and the next API call rejects the transcript. “The loop worked once, then errored” scenarios often hinge on this. - More prompt instructions ≠ more reliability. Every system-prompt edit is an unevaluated deploy: version it and re-run the evals (Module 8) before trusting it.
- Streaming improves perceived latency, not total latency. Tokens arrive as they are generated, so the user sees progress sooner — the run takes just as long. It’s a UX lever, not a performance one.
Key takeaways
- The LLM never executes tools. It emits a structured request; your code executes it and returns the result — you, not the model, control what can happen.
- The lifecycle: schemas in via
tools=→tool_callsout → your code executes → result back viarole: "tool"+tool_call_id→ continue; a reply withouttool_callsis the final answer. - Tool descriptions are prompt engineering — when the agent picks the wrong tool or none at all, fix the description first.
- Tool errors go back to the model as observations; retries get backoff and a hard cap,
and every agent loop gets
max_iterations. - The ReAct loop is a dynamic prompt chain: each iteration rebuilds the prompt by
appending the latest observations to
messages. - LangGraph buys you explicit state, declarative routing, and built-in streaming — the same loop, its moving parts promoted to inspectable structure.
ScoutStateis born (question,messages) — from here on it only ever grows by adding fields.
Keep going
Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.
Scout works — but should it even be an agent? Next module: architecture patterns, trade-offs, and when a plain workflow beats an agent.
Lab code · Course index · ← Module 1 · Module 3 →
References
- ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al., 2022; the founding paper behind the loop you just built.
- LangGraph Graph API —
official docs for
StateGraph, nodes, edges, conditional edges, and reducers (LangGraph 1.x). - LangGraph streaming —
stream modes, including
custommode withget_stream_writerfor any LLM client. - NVIDIA NIM LLM APIs — the
OpenAI-compatible endpoint at
integrate.api.nvidia.com/v1used by every Scout call. - Tavily Search API —
the endpoint behind Scout’s
web_searchtool. - NCP-AAI certification page — the official blueprint; Agent Development is weighted at 15%.