Build Your First AI Agent: Tool Calling from Scratch to LangGraph (NCP-AAI Module 2)

This is Module 2 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.

Your Module 1 setup works: Nemotron answers in half a second, sounding confident. Now ask it what NVIDIA announced at GTC 2026. Either it admits its training data ends too early, or — worse — it invents a keynote. Confidently. With dates.

That’s not a flaw you can prompt away. A bare LLM is a frozen snapshot: it cannot look anything up, run anything, or check its claims against the world. It can only talk. Every “agent” demo you’ve ever seen rests on one mechanism — the model asking your code to act on its behalf, then reasoning over what came back.

Today you build that mechanism twice: by hand in about 80 lines, so you know what every piece does, then in LangGraph, so you know exactly what a framework saves you — and what it costs. By the end, Scout answers questions about last month with sources instead of guesses.

In this module

You’ll learn:
- Implement the ReAct loop (reason → act → observe) from scratch with a raw OpenAI-compatible client.
- Define tool schemas and execute the full tool-call lifecycle: request, execution, result, continuation.
- Handle tool failures gracefully: surface errors to the model, retry with backoff, cap iterations.
- Rebuild the same agent with LangGraph primitives — StateGraph, nodes, conditional edges, state — and stream loop events and answer tokens client-side.
- Explain what a framework buys you — and recognize the deprecated patterns that exam-era tutorials still teach.
You’ll build: Scout v0.2 — a single-agent ReAct loop with a web_search tool, hand-rolled first, then as your first LangGraph graph.
Exam domains covered: D2 — Agent Development — 15% of the exam.
Prerequisites: Module 1 (NVIDIA API key, uv, the labs repo cloned) plus a free Tavily API key — you’ll create it in the lab.

Where you are

✅ Module 1 — What Is Agentic AI? — vocabulary, landscape, first NIM call
👉 Module 2 — Build Your First AI Agent (you are here)
⬜ Modules 3–14 — architecture, cognition, memory, RAG, multi-agent, evals, guardrails, deployment, and the exam

Scout before: config.py and one direct call — question in, answer out, the model decides nothing but words. Scout after: an agent that decides to call web_search, observes, loops, and answers with fresh facts and source URLs. ScoutState — the state object that will carry Scout to Module 13 — exists.

From One Call to a Loop: The ReAct Pattern

Why isn’t one LLM call enough? Three structural reasons. The model’s knowledge is frozen at training time, so anything recent is invisible. It cannot act — no search, no API call, no file read. And it gets no feedback: whatever it says first is final, with no chance to notice it’s wrong.

The fix is a pattern, not a bigger model. ReAct is an agent pattern that interleaves reasoning and action: the model thinks about what it needs, takes one action, observes the result, and reasons again — looping until it can answer. The name comes from the 2022 paper by Yao et al. (arXiv 2210.03629), where the loop ran on parsed text — the model literally wrote Thought:, Action:, and Observation: lines. Today you don’t parse anything: modern APIs structure the same loop as JSON through native tool calling, the mechanism this whole module is about.

Here is one full turn of the loop you’re about to build:

sequenceDiagram
    participant U as User
    participant L as Your code (the loop)
    participant M as LLM (Nemotron via NIM)
    participant T as web_search (your function)
    U->>L: "What did NVIDIA announce at GTC 2026?"
    L->>M: messages + tool schemas
    M-->>L: tool_calls: web_search({"query": "..."})
    L->>T: execute the call
    T-->>L: results (or an error string)
    L->>M: messages + role:"tool" result
    M-->>L: final answer, no tool_calls
    L-->>U: answer with source URLs

One ReAct turn: the model never touches the tool — it asks, your code acts.

Look at who does what. The model only ever produces messages. Your code — the thing in the middle — sends the transcript, executes requests, appends results, and decides when to stop. That loop is the agent; the rest of this course is that loop growing more sophisticated.

Tool Calling: How the Model Asks Your Code to Act

Tool calling is the structured mechanism by which a model requests an action from your code: it emits a JSON request naming a function and its arguments; your code executes and returns the result. The model learns what it can request from a schema you pass on every call. Here is Scout’s first tool, exactly as the model sees it:

# module-02/scout/tools.py (the schema — the function itself is ~20 lines of httpx)
WEB_SEARCH_SCHEMA = {
    "type": "function",
    "function": {
        "name": "web_search",
        "description": (
            "Search the web for current, factual information. Use this for "
            "anything that may have happened after your training data, and "
            "for any fact you are not certain about. Returns the top results "
            "as title, URL, and content snippet."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "A plain-language search query, "
                                   "e.g. 'NVIDIA GTC 2026 announcements'.",
                },
            },
            "required": ["query"],
        },
    },
}

Read that description again, because it’s the most underrated line in agent development: the tool description is prompt engineering. The model decides when to call web_search from that text and nothing else — it never sees your Python. When an agent ignores a tool it should use, or hammers one it shouldn’t, the description is the first thing to fix.

The lifecycle has four steps, and the exam tests the order:

Request: you call the API with tools=[...]; the model replies with an assistant message carrying tool_calls — each with an id, a function name, and JSON arguments. Note the plural: one turn can request several calls — hence the lab’s for loops.
Execution: your code dispatches each call to the matching Python function.
Result: you append one message with role: "tool", the matching tool_call_id, and the result as content — exactly one such reply per tool call.
Continuation: you call the model again with the grown transcript. A reply without tool_calls is the signal: that’s the final answer, and the loop ends.

In the data, a complete turn is five messages — worth seeing once in full, because exam scenarios poke at exactly this id plumbing:

[
  {"role": "system", "content": "You are Scout, a research assistant. ..."},
  {"role": "user", "content": "What did NVIDIA announce at GTC 2026?"},
  {"role": "assistant", "content": null, "tool_calls": [
    {"id": "call_1", "type": "function", "function": {
      "name": "web_search",
      "arguments": "{\"query\": \"NVIDIA GTC 2026 announcements\"}"}}]},
  {"role": "tool", "tool_call_id": "call_1", "content": "[{\"title\": \"...\", ...}]"},
  {"role": "assistant", "content": "At GTC 2026, NVIDIA announced ... Sources: ..."}
]

content is null when the model only requests tools; the tool message answers call_1 by id; the reply with no tool_calls ends the loop.

Notice what step 4 implies: every iteration rebuilds the next prompt by appending observations to messages. That makes the ReAct loop a dynamic prompt chain — each prompt constructed at runtime from the results of the previous step, rather than fixed in advance. The exam’s “dynamic prompt chains” objective sounds exotic; you’re looking at it — the loop is the entire implementation.

Three names exist for one mechanism:

Term	Where you’ll see it	Anything different?
tool calling	The NCP-AAI blueprint, LangGraph docs, this course	The canonical term — use it
”function calling”	OpenAI’s original API docs and older tutorials	Same mechanism, legacy name
”tools”	Anthropic and NIM API parameter names	Same mechanism (`tools=`, `tool_calls`)

One adjacent dial: tool_choice. The default "auto" lets the model decide whether to call a tool — that decision is the agentic part. "required" forces a call, "none" forbids one, naming a function forces that one. Scout stays on auto.

When Tools Fail: Error Handling in the Loop

A loop that calls real APIs will hit real failures, and the exam’s error-handling objective is about exactly this. Four failure modes cover most of what you’ll see:

The tool raises. Tavily times out, the network drops, the key is wrong.
The model sends invalid arguments. Malformed JSON, a missing field, a parameter that doesn’t exist.
An API rate-limits you. HTTP 429 — on the tool side or on the LLM side: the NVIDIA Inference Microservices (NIM) free tier allows 40 requests per minute as of June 2026.
The model loops without concluding. Every iteration is a billed LLM call; an agent that never decides it’s done is a token furnace.

The strategies, in the order you should reach for them:

Return the error to the model as an observation. This is the counterintuitive one. Instead of crashing, catch the exception and send the error text back as the tool result. Given that feedback, the model usually repairs its own call on the next iteration — or tells the user plainly that the search failed. This contract is small enough to show whole; it’s the lab’s execute_tool(), the dispatcher both versions of the agent share:

# module-02/scout/tools.py (the dispatcher — the error contract in code)
def execute_tool(name: str, arguments_json: str) -> str:
    """Run one tool call and ALWAYS return a string observation."""
    tool = TOOLS.get(name)
    if tool is None:
        return f"Error: unknown tool '{name}'. Available tools: {', '.join(TOOLS)}."
    try:
        arguments = json.loads(arguments_json)
        result = tool(**arguments)
    except (json.JSONDecodeError, TypeError) as exc:
        # The model sent malformed or mismatched arguments — tell it so.
        return f"Error: invalid arguments for '{name}': {exc}"
    except (httpx.HTTPError, RuntimeError) as exc:
        # The tool itself failed (network, HTTP status, missing key) — surface it.
        return f"Error: '{name}' failed: {exc}"
    return json.dumps(result)

Retry with capped exponential backoff. For transient failures like 429, wait and retry — 1 s, then 2 s, then 4 s. But cap it at 2 retries or so: beyond that you’re burning tokens and latency on a dependency that’s telling you to go away. The official study guide points to the Azure Architecture Center’s Retry pattern here, alongside the circuit breaker pattern — stop calling a dependency entirely after repeated failures, probing occasionally until it recovers. Know the concept for the exam; Scout won’t need an implementation until it has real traffic.

Cap the iterations. max_iterations is the seatbelt for failure mode four: a hard upper bound on loop turns, after which the agent stops and says so. Both lab versions set it to 6. The cap is the backstop, not the steering: Scout’s system prompt also says one or two searches are usually enough — without that nudge, a reasoning model will happily rephrase the same search six times. An unbounded loop attached to a paid API is an incident report waiting to be filed.

What you should not do is silence anything. A bare except: pass around a tool call doesn’t make the agent robust; it makes the model reason over a hole in its own transcript.

Why a Framework? Your First LangGraph Graph

The hand-rolled loop works, and it’s honest work — every line visible. So why reach for a framework? Because of what the manual version doesn’t have. Its state lives in a local variable inside one function — nothing outside the loop can inspect or resume it. Its routing is an if welded inside a while. None of it is reusable as the system grows past one agent, nothing streams without extra plumbing, and when Scout later needs to pause mid-run for a human or recover a crashed session, local variables have nothing to offer.

LangGraph’s bet is to make those things explicit. You model the loop as a graph: a StateGraph is a graph whose nodes read and update one shared, typed state object. A node is a function that takes the state and returns an update. An edge declares which node runs next; a conditional edge decides it at runtime from the state — your if tool_calls: promoted to a declared, inspectable routing rule. Wire them up, compile, and the framework runs the loop.

The state is the part with a future. ScoutState is born here with exactly two fields:

# module-02/scout/state.py
class ScoutState(TypedDict):
    question: str
    # OpenAI-format message dicts; operator.add means nodes RETURN new
    # messages and LangGraph appends them — nodes never mutate state.
    messages: Annotated[list, operator.add]

This state object will grow module by module — fields are only ever added, never renamed. A planner adds plan (Module 4), retrieval-augmented generation (RAG) adds sources (Module 6), the agent team adds claims and report (Module 7). The Annotated[list, operator.add] is a reducer: it tells LangGraph how to merge a node’s returned messages into the state — append, here.

The honest comparison:

	Hand-rolled loop	LangGraph
State management	Local variables in one function	Typed `ScoutState`, shared and inspectable
Routing	`if`/`else` inside a `while`	Declared conditional edges
Loop control	`for _ in range(MAX_ITERATIONS)`	Routing decision + recursion limit
Persistence-ready	No — state dies with the process	Yes — a checkpointer plugs in (Module 5)
Streaming	Build it yourself	`graph.stream()` built in — used in this lab
Lines of code	~80, all yours	~150, but state, routing, and streaming are declared, not improvised

And the graph you’ll build — same loop, drawn instead of nested:

flowchart LR
    S((START)) --> A[agent<br/>LLM call with tool schemas]
    A -->|tool_calls pending| T[tools<br/>hand-written executor]
    T --> A
    A -->|no tool_calls, or<br/>iteration cap hit| E((END))

Scout’s first graph: the conditional edge after agent is the ReAct loop’s exit door.

One sentence on what you just bought, then we build: explicit state and declarative routing are what make persistence (Module 5), human interrupts (Module 9), and the move to a multi-agent team plug-in changes instead of rewrites.

Hands-on lab: build it

Objective: write the same ReAct agent twice — raw SDK, then LangGraph — and verify both produce the same trace. The full code lives in module-02/ of the labs repo; the article shows the load-bearing excerpts.

Observable result: uv run python -m scout.graph "What did NVIDIA announce at GTC 2026?" prints the loop trace, then a final answer grounded in fresh web results, with source URLs.

Step 1 — Setup

From the repo root (uv sync is enough if you cloned after this module landed):

uv add langgraph~=1.2 langchain~=1.3 langchain-core~=1.4

(Only langgraph is imported here; the other two pins belong to the course’s frozen stack — locked now so later modules don’t move uv.lock.)

Create a free API key at tavily.com — free tier, no credit card — and add it to the repo-root .env next to your NVIDIA key:

NVIDIA_API_KEY=nvapi-...
TAVILY_API_KEY=tvly-...

Same rule as always: keys live in .env, which is gitignored — never in code.

Step 2 — The tool

scout/tools.py holds the function and the schema you saw above, side by side. The function is deliberately a bare httpx wrapper — no integration package — because a tool is just a function:

# module-02/scout/tools.py (excerpt)
def web_search(query: str, max_results: int = 5) -> list[dict]:
    """Search the web via the Tavily API and return the top results."""
    config.load_env()
    api_key = os.environ.get("TAVILY_API_KEY")
    if not api_key:
        raise RuntimeError("TAVILY_API_KEY is not set — get a free key "
                           "at tavily.com and add it to the repo-root .env")
    response = httpx.post(
        TAVILY_URL,
        headers={"Authorization": f"Bearer {api_key}"},
        json={"query": query, "max_results": max_results},
        timeout=30.0,
    )
    response.raise_for_status()
    return [
        # Snippets only: Scout does not fetch or ingest pages until module 06.
        {"title": r["title"], "url": r["url"], "content": r["content"][:500]}
        for r in response.json()["results"]
    ]

The file’s third piece is execute_tool() — the dispatcher you already read in the error-handling section, sitting right under the function and the schema.

Step 3 — The loop by hand

scout/react_manual.py is the whole agent in ~80 lines. It opens with the other reliability lever of exam objective 2.1 — the system prompt:

# module-02/scout/react_manual.py (the agent's standing orders)
SYSTEM_PROMPT = (
    "You are Scout, a research assistant. Use the web_search tool for current "
    "events and for any fact you are not certain about. One or two searches "
    "are usually enough — then answer with what you found instead of "
    "searching again. In your final answer, cite the sources you used as "
    "full URLs (starting with http), never as titles alone."
)

Two of those sentences exist because live runs failed without them: the search nudge you read about in the error-handling section, and the full-URL requirement that keeps answers verifiable instead of citing titles. The heart:

# module-02/scout/react_manual.py (the loop — full file in the repo)
def run_agent(question: str, client=None) -> str:
    """The ReAct loop: reason -> act -> observe, until a final answer."""
    client = client or llm.get_client()
    messages: list[dict] = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]
    for _ in range(MAX_ITERATIONS):
        choice = call_model(client, messages).choices[0]
        message = choice.message

        if not message.tool_calls:  # no tool requested = the final answer
            if not message.content:
                # Module 01's trap again: reasoning ate the whole token budget.
                raise RuntimeError(
                    "Empty final answer "
                    f"(finish_reason={choice.finish_reason}) — raise max_tokens."
                )
            return message.content

        # The model only ASKED. Recording its request, executing it, and
        # answering each tool_call_id is our job — one reply per call.
        messages.append(
            {"role": "assistant", "content": message.content,
             "tool_calls": [call.model_dump() for call in message.tool_calls]}
        )
        for call in message.tool_calls:
            print(f"[agent] tool_call: {call.function.name}({call.function.arguments})")
            observation = execute_tool(call.function.name, call.function.arguments)
            print(f"[tools] {describe_observation(observation)}")
            messages.append(
                {"role": "tool", "tool_call_id": call.id, "content": observation}
            )
    return f"Stopped after {MAX_ITERATIONS} iterations without a final answer."

call_model() (not shown) is one tool-enabled completion with capped backoff on HTTP 429, and generous max_tokens — Nemotron 3 spends reasoning tokens before answering (Module 1 covered that trap).

Step 4 — Run it and read the trace

cd module-02
uv run python -m scout.react_manual "What did NVIDIA announce at GTC 2026?"

[agent] tool_call: web_search({"query": "NVIDIA GTC 2026 announcements"})
[tools] 5 results

At GTC 2026 (March 2026), NVIDIA announced the Nemotron 3 family ...
Sources:
1. https://...
2. https://...

That trace is the whole module in three lines: the model reasoned that GTC 2026 is past its knowledge, acted by requesting web_search, observed five results, and answered with citations. Ask something timeless (“What is a mutex?”) and you’ll see zero tool calls — the decision is the model’s, which is what makes this an agent.

Step 5 — The same agent as a graph

scout/state.py defines ScoutState (shown earlier). scout/graph.py rebuilds the loop as nodes and edges — same client from scout/llm.py, same tools, same system prompt, imported from the manual version so the two can’t drift:

# module-02/scout/graph.py (structure — full file in the repo)
def tools_node(state: ScoutState) -> dict:
    """Execute every tool call the model just requested — our code, by hand."""
    last = state["messages"][-1]
    results = []
    for call in last["tool_calls"]:
        observation = execute_tool(call["function"]["name"],
                                   call["function"]["arguments"])
        results.append(
            {"role": "tool", "tool_call_id": call["id"], "content": observation}
        )
    return {"messages": results}

def route_after_agent(state: ScoutState) -> str:
    """Conditional edge: tools requested -> 'tools'; otherwise -> END."""
    llm_turns = sum(m["role"] == "assistant" for m in state["messages"])
    last = state["messages"][-1]
    if last.get("tool_calls") and llm_turns < MAX_ITERATIONS:
        return "tools"
    return END

def build_graph():
    builder = StateGraph(ScoutState)
    builder.add_node("agent", agent_node)
    builder.add_node("tools", tools_node)
    builder.add_edge(START, "agent")
    builder.add_conditional_edges("agent", route_after_agent,
                                  {"tools": "tools", END: END})
    builder.add_edge("tools", "agent")
    return builder.compile()

Three things to notice. The while is gone — the cycle agent → tools → agent is the loop, and the conditional edge is its exit. The max_iterations guard became a routing decision: at the cap, the edge routes to END even if the model wanted another tool — and main() then prints the same Stopped after 6 iterations... line as the manual version, so the two stay comparable even on failure. And the tool executor is our own ten lines, not a prebuilt import.

Run it on the same question and compare with step 4 — the traces match line for line.

Step 6 — Watch the loop stream

The graph gives you streaming without extra plumbing. Two stream modes at once: "updates" emits one event per node run (the loop events); "custom" carries whatever nodes emit through get_stream_writer() — here, every answer token as it arrives:

# module-02/scout/graph.py (main — full file in the repo)
for mode, chunk in graph.stream(initial_state(question),
                                stream_mode=["updates", "custom"]):
    if mode == "custom":
        print(chunk["token"], end="", flush=True)  # answer tokens, live
        continue
    for update in chunk.values():
        for message in update["messages"]:
            ...  # print [agent] / [tools] trace lines

(One thing _collect() receives and deliberately drops: delta.reasoning_content, Nemotron 3’s thinking stream — reasoning budgets are Module 4 territory.)

Run step 5’s command again: the [agent] and [tools] lines frame the run, then the final answer types itself out token by token. That’s a dynamic conversation flow with real-time streaming, client-side — the exam objective, covered. Serving those tokens to end users over HTTP is a Module 10 conversation; the mechanism stays the same.

Step 7 — Smoke tests

uv run pytest module-02/tests/                       # offline: contracts + loop logic
SCOUT_LIVE_TESTS=1 uv run pytest module-02/tests/    # + 2 real agent runs

The offline tests replay the whole loop against a scripted fake client — both versions, zero network. The live tests run both agents on a real question and assert the contract: at least one tool call, a cited non-empty answer, at most 6 LLM calls. The cumulative rule holds: module-02/scout/ still passes Module 1’s tests.

Try it yourself (no solution provided):

Add a second tool, get_current_date — no arguments, returns today’s date as text. Register it in TOOLS and TOOL_SCHEMAS, ask “What day is it today?”, and watch the model pick the right tool purely from the descriptions.
Break your Tavily key in .env, re-run the graph, and verify the agent explains the failure instead of crashing — execute_tool turning an exception into an observation.

In production

What changes at scale? First, tools are your attack surface: every tool is code the model can trigger with arguments the model chose — sandbox execution, allowlist what each tool may reach, and put a timeout on every call. Second, the loop multiplies cost: each iteration is a billed LLM call, so max_iterations is a budget cap, not just a safety guard. Third, make tools idempotent where you can: a retry after a timeout must not post the same order twice; searches are naturally safe, mutations need idempotency keys. Fourth, rate limits live on both sides — your LLM endpoint and each tool’s API. Fifth, prompts are config: version every system-prompt and tool-description edit and re-run the evals on each change — Module 11 stamps a PROMPT_VERSION onto every trace for exactly this reason. Finally, trace every tool call: name, arguments, latency, outcome. When an agent misbehaves, the tool-call trace is where the answer lives — Module 11 builds that observability properly.

Exam corner

What the exam tests here. Per the official study guide, Agent Development (D2, 15% of the exam) covers: engineering prompts and dynamic prompt chains for reliability (2.1 — system prompt, tool descriptions, the loop’s accumulating messages); building and connecting custom tools, APIs, and functions (2.3 — the heart of this module); error handling with retry logic and graceful failure recovery (2.4); dynamic conversation flows with real-time streaming (2.5); and evaluating and refining agent decision-making (2.6 — at this stage: read the trace, adjust tool descriptions and iteration caps; formal evaluation is Module 8). The reasoning-and-action framework itself (ReAct) is objective 1.2 from the architecture domain — you’ve now implemented it.

Quiz — answers after question 5.

An agent’s transcript reads: (1) user question → (2) assistant message containing tool_calls → (3) ??? → (4) assistant message with the final answer. What happened at step 3?
- A) The model executed the tool and read the result itself
- B) Application code executed the tool and appended a role: "tool" message with the matching tool_call_id
- C) The tool result was inserted into the system prompt
- D) The transcript was reset so the model could start fresh
Mid-run, web_search raises a timeout exception. Which handling is best?
- A) Let the exception propagate — failing fast prevents wrong answers
- B) Catch it and return an empty result list so the run continues smoothly
- C) Catch it, return the error text to the model as the tool result, and retry transient failures with backoff, capped
- D) Catch it and retry the tool in a loop until it succeeds
Your agent answers a current-events question from its training data instead of calling the web_search tool sitting right there in tools=. The most likely cause?
- A) The temperature is set too high
- B) The tool’s description doesn’t tell the model when the tool applies
- C) The model is too small to use tools
- D) max_tokens is too low
An agent keeps searching — six, seven, eight tool calls — burning tokens without concluding. Which fix addresses the failure directly?
- A) Increase the context window so it can hold more results
- B) Remove the search tool so it has to answer
- C) Cap iterations and give the loop an explicit termination condition (no tool_calls → END)
- D) Raise max_tokens so it can think longer per turn
A ReAct agent starts with the working assumption that a product launched in 2025. Its first search returns documentation showing it launched in 2026. What does a correctly built ReAct agent do next?
- A) Completes its original plan, then mentions the discrepancy at the end
- B) Incorporates the observation into its next reasoning step and adjusts its next action accordingly
- C) Discards the run and restarts with a corrected assumption
- D) Stops and asks the user which year is correct

Answers. 1 — B. The model emits the request; application code executes it and replies with role: "tool" plus the matching tool_call_id — one reply per call. (A is the classic trap: the model never executes anything. C is wrong: results go into the conversation, not the system prompt.) 2 — C. Surfacing the error as an observation lets the model adapt — rephrase, try another approach, or explain the failure — and a capped backoff handles transience. A turns a hiccup into an outage, B hides the failure (the model may conclude “no results exist”), D is an infinite retry — a failure mode, not a strategy. 3 — B. Tool selection is driven by the schema text the model reads. A vague description gives the model no reason to prefer the tool over its own parametric knowledge. Temperature and model size can contribute; the description is the first, cheapest, most likely fix. 4 — C. The direct fix for non-termination is a termination guard: max_iterations plus an explicit stop condition — in the graph, the conditional edge routing to END. A and D make each wander longer; B amputates the agent instead of bounding it. 5 — B. ReAct’s defining property is that each reasoning step conditions on the latest observation — the agent adjusts mid-run rather than executing a fixed plan. (C wastes the work; D abdicates a decision the evidence already settled.)

Traps to avoid:

“The model executes the tool.” It emits a structured request; your code executes. If an answer option has the model running anything, it’s bait.
“Function calling” vs “tool calling.” Same mechanism; the exam and the blueprint say tool calling.
The dangling tool call. Every tool_call expects exactly one role: "tool" reply with its tool_call_id. Forget the reply — or the id — and the next API call rejects the transcript. “The loop worked once, then errored” scenarios often hinge on this.
More prompt instructions ≠ more reliability. Every system-prompt edit is an unevaluated deploy: version it and re-run the evals (Module 8) before trusting it.
Streaming improves perceived latency, not total latency. Tokens arrive as they are generated, so the user sees progress sooner — the run takes just as long. It’s a UX lever, not a performance one.

Key takeaways

The LLM never executes tools. It emits a structured request; your code executes it and returns the result — you, not the model, control what can happen.
The lifecycle: schemas in via tools= → tool_calls out → your code executes → result back via role: "tool" + tool_call_id → continue; a reply without tool_calls is the final answer.
Tool descriptions are prompt engineering — when the agent picks the wrong tool or none at all, fix the description first.
Tool errors go back to the model as observations; retries get backoff and a hard cap, and every agent loop gets max_iterations.
The ReAct loop is a dynamic prompt chain: each iteration rebuilds the prompt by appending the latest observations to messages.
LangGraph buys you explicit state, declarative routing, and built-in streaming — the same loop, its moving parts promoted to inspectable structure.
ScoutState is born (question, messages) — from here on it only ever grows by adding fields.

Keep going

Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.

Scout works — but should it even be an agent? Next module: architecture patterns, trade-offs, and when a plain workflow beats an agent.

Lab code · Course index · ← Module 1 · Module 3 →

References

ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al., 2022; the founding paper behind the loop you just built.
LangGraph Graph API — official docs for StateGraph, nodes, edges, conditional edges, and reducers (LangGraph 1.x).
LangGraph streaming — stream modes, including custom mode with get_stream_writer for any LLM client.
NVIDIA NIM LLM APIs — the OpenAI-compatible endpoint at integrate.api.nvidia.com/v1 used by every Scout call.
Tavily Search API — the endpoint behind Scout’s web_search tool.
NCP-AAI certification page — the official blueprint; Agent Development is weighted at 15%.