The NVIDIA Agentic Stack: NIM, NeMo, and Nemotron in Practice (NCP-AAI Module 12)
This is Module 12 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.
You have been building on the NVIDIA stack since Scout’s first API call, eleven modules ago — and you have probably never seen the whole map. Here is the exam question that punishes that: “Which NVIDIA product curates training data?” If you just hesitated between Curator, Customizer, and Evaluator, you left points on the table. NVIDIA Platform Implementation is only 7% of the NCP-AAI exam, but these are the cheapest points on the paper: pure product mapping, no scenario gymnastics — if you have the map. This module names every brick Scout already stands on, fills the gaps you haven’t touched (TensorRT-LLM, Triton, the NeMo microservices), then replaces feelings with numbers: you’ll profile Scout with NVIDIA’s own toolkit and benchmark two Nemotron models on your Module 8 golden set. By the end, “which product does what” is the easiest part of your exam.
In this module
- You’ll learn:
- Map the NVIDIA agentic platform end to end — hardware → inference engine → serving → NIM → Nemotron → NeMo suite → NeMo Agent Toolkit — and place every brick Scout already uses.
- Deploy a self-hosted NIM (NVIDIA Inference Microservices) in guided concept: what the container ships, and when it beats the hosted endpoints.
- Optimize an agentic workflow with the NeMo Agent Toolkit: profile tokens and latency per step, name the bottleneck.
- Leverage TensorRT-LLM and Triton conceptually: which cuts latency, how, and which serves.
- Compare two Nemotron 3 models on the Module 8 golden set and assign a model to each role with data.
- Build agents on multimodal generative models (text, vision, audio) — with an optional vision call to Nemotron 3 Omni.
- You’ll build: Profile Scout with the NeMo Agent Toolkit and benchmark two Nemotron 3 models on your Module 8 golden set.
- Exam domains covered: D7 — NVIDIA Platform Implementation — 7% of the exam; D2 (partial) — Agent Development — objective 2.2 (multimodal models).
- Prerequisites: Modules 1–11; the
Module 8 golden set and judge (the comparison runs on them); NVIDIA API
key configured. New setup:
nvidia-natviauv— laptop only, nothing paid.
Where you are
- ✅ Modules 1–7 — from a first NIM call to a supervisor team with Planner, Searcher, Reader, Fact-checker, Writer
- ✅ Modules 8–11 — evals and golden set, guardrails + HITL, FastAPI + Docker, tracing and cost observability
- 👉 Module 12 — The NVIDIA Agentic Stack (you are here)
- ⬜ Module 13 — capstone: assemble, harden, ship Scout v1.0
- ⬜ Module 14 — the exam: strategy, mock exam, debrief
Scout before: complete and observable, but never profiled — running on the model it started with, chosen by default. Scout after: profiled with NVIDIA’s official toolkit, bottleneck named, a model per role chosen with numbers — and you hold the platform map that Domain 7 tests.
The NVIDIA agentic platform, mapped
The platform is a layer cake, and you have been eating from the middle of it since Module 1. Read it bottom-up:
flowchart BT
HW["Hardware — DGX systems, cloud GPUs<br/>(Brev: GPUs on demand · LaunchPad: NVIDIA's guided-lab environment)"]
TRT["TensorRT-LLM — compiles & optimizes the model<br/>(quantization, KV cache, fused kernels)"]
TRITON["Triton Inference Server — serves models at scale<br/>(in-flight batching, multi-model, metrics)"]
NIM["NIM — model + engine + server in one container,<br/>OpenAI-compatible API · Scout: every LLM call since M1"]
MODELS["Nemotron 3 — nano / super / ultra (+ nano-omni)<br/>Scout: worker since M1, judge since M8"]
NEMO["NeMo suite — Guardrails (M9) · Retriever (M6) ·<br/>Curator · Customizer · Evaluator (concept)"]
NAT["NeMo Agent Toolkit — profile, evaluate, connect<br/>Scout: M12 (this module)"]
HW --> TRT --> TRITON --> NIM --> MODELS --> NEMO --> NAT
Bottom-up: silicon; the engine that optimizes a model for it; the server that serves it; the container that packages all three; the models; the product suite around them; the toolkit that measures the whole thing.
The table below is the highest-value Domain 7 asset — the exam asks “which product does X” in almost exactly this shape:
| Product | What it does (one line) | Where Scout uses it |
|---|---|---|
| NIM | Containerized inference microservice: model + optimized engine + server behind an OpenAI-compatible API | Every LLM call since Module 1 (hosted) |
| Nemotron | NVIDIA’s open model family built for agentic work (reasoning, tool calling) | Worker since M1; judge since Module 8 |
| NeMo Guardrails | Programmable input/dialog/output rails for LLM apps | Module 9 |
| NeMo Retriever | Embedding and reranking NIMs for retrieval pipelines | Module 6 |
| NeMo Curator | Curates training data: dedup, filtering, cleaning at scale (pip library, partly CPU) | Concept (this module) |
| NeMo Customizer | Managed fine-tuning of models (microservice; Kubernetes + GPU) | Concept (this module) |
| NeMo Evaluator | Managed evaluation of models and pipelines (microservice; Kubernetes + GPU) | Concept — Scout uses its own M8 harness |
| NeMo Agent Toolkit | Framework-agnostic profiling, evaluation, and connection layer for agent workflows | This module’s lab |
| TensorRT-LLM | Compiles and optimizes a model for a target GPU (quantization, KV cache, fused kernels) | Concept — inside every NIM Scout calls |
| Triton Inference Server | Serves models in production: in-flight batching, multi-model, metrics | Concept — inside every NIM Scout calls |
| Brev / DGX | GPU access: on-demand cloud GPUs / NVIDIA’s AI infrastructure systems | Concept — the lab’s self-host walkthrough |
Two reading rules defuse most D7 traps. First, NeMo is a family of distinct products, not a product — “we used NeMo” is as precise as “we used AWS”. Second, the layers don’t substitute: Triton doesn’t optimize, TensorRT-LLM doesn’t serve, and a NIM is neither a model nor a website.
NIM: what’s actually in the box — hosted vs self-hosted
NVIDIA Inference Microservices (NIM) are containerized inference
services: one Docker container packaging a model, an inference engine
pre-optimized for the GPU it detects at startup, a production server, and
an OpenAI-compatible API on top — the same /v1/chat/completions surface
you’ve been calling all course.
Two components inside have names the exam expects you to keep straight. TensorRT-LLM is the compiler and optimizer: it takes model weights and builds a GPU-specific engine, applying quantization (lower-precision weights and activations — smaller, faster math at a controlled accuracy cost), KV-cache management (reusing the attention keys and values already computed instead of recomputing them for every token), and fused kernels (several GPU operations merged into a single launch). It makes each forward pass cheaper. Triton Inference Server is the server: it loads engines and handles production serving — in-flight batching (new requests join a running batch instead of waiting for it to finish, lifting GPU utilization and throughput), multi-model hosting, metrics. One optimizes the model; the other runs the traffic. A latency question is usually a TensorRT-LLM question; a throughput-under-concurrency question is usually a Triton question.
So what is build.nvidia.com? NVIDIA hosting these same NIMs and renting you the API: identical surface, zero infrastructure, free-tier rate limits (40 req/min as of June 2026). Module 10 already gave you the decision framework; here it is with NVIDIA names:
| Criterion | Hosted endpoints (build.nvidia.com) | Self-hosted NIM |
|---|---|---|
| Cost model | Variable: per-call/credits, zero idle cost | Fixed: GPU runs whether busy or not |
| Data residency | Prompts and outputs leave your network | Everything stays inside your perimeter |
| Latency | Internet round-trip, shared capacity | Local network, dedicated GPU |
| Throughput | Rate-limited (40 req/min free tier, June 2026) | Your hardware is the limit |
| Ops burden | None | Containers, drivers, GPU capacity, upgrades |
| Best when | Prototyping, spiky/low volume, no residency constraint | Compliance, air-gapped networks, sustained volume |
The Nemotron family: picking the right model for each role
Nemotron is NVIDIA’s open model family tuned for agentic work —
reasoning, tool calling, long context. The current generation is
Nemotron 3, a Mixture-of-Experts (MoE) lineup: only a fraction of the
weights — the active parameters, the a3b in the IDs — fires per
token, so cost and latency track a dense model of that active size, not
the headline one. Two of its hosted NIMs have lived in config.py all
course; ultra completes the family:
| Model | Shape | Strengths | Typical agent role | Cost / latency |
|---|---|---|---|---|
nvidia/nemotron-3-nano-30b-a3b | MoE, ~3B active params | Fast, cheap, strong tool calling, 1M context | High-volume routing, tool loops, workers | Lowest |
nvidia/nemotron-3-super-120b-a12b | MoE, ~12B active params | Deeper multi-step reasoning, synthesis | Hard reasoning steps; LLM-as-judge | Mid |
nvidia/nemotron-3-ultra-550b-a55b | MoE, ~55B active params | Strongest reasoning in the family | Hardest problems, offline/batch work | Highest |
The lineup is role-based, not “good/better/best for everything” — exactly
how the exam frames model-selection questions, and exactly how the lab
makes you decide: nano has been Scout’s worker since Module 1, super the
judge since Module 8, and today you’ll test whether those defaults survive
contact with your golden set. Because model names live only in config.py
(the contract frozen in Module 1), the swap is a one-line change — twelve
modules of discipline paying off in one lab.
Market context, as of June 2026: at GTC 2026 NVIDIA announced the Nemotron Coalition — Nemotron 4 co-developed with partners including Mistral, LangChain, Cursor, and Perplexity — alongside native Nemotron-plus-NAT (NeMo Agent Toolkit) integration in LangChain. For the exam, generation 3 is what the hosted catalog serves.
NeMo Agent Toolkit: profiling your agent like NVIDIA does
The NeMo Agent Toolkit (NAT) is NVIDIA’s framework-agnostic layer for
working across agent stacks: it connects to workflows written in
LangGraph, CrewAI, LlamaIndex and others, and gives you a profiler
(runtimes, token usage, bottleneck analysis per step), an evaluator, and
MCP client/server support. It doesn’t replace your framework — Scout stays
pure LangGraph — it wraps it and measures it. Install is pip install nvidia-nat; the labs pin ~=1.7.0 (i.e. >=1.7.0,<1.8) because the
toolkit releases monthly and the API moves.
One naming warning, genuinely exam-relevant: NAT used to be called AgentIQ, then the Agent Intelligence Toolkit (AIQ) — and the official study guide’s recommended readings still use the old names and old doc URLs. Same product, renamed. Don’t let the rename cost you a known answer.
Draw the boundary with Module 11 precisely, because the two tools look superficially similar. Langfuse is continuous production observability: always on, every request, dashboards, alerts — the flight recorder. The NAT profiler is point-in-time optimization analysis: a controlled batch, run on purpose, to find the bottleneck before you change something — the wind tunnel. You need both; Domain 7’s objective 7.3 (“optimize workflows with the NeMo Agent Toolkit”) is about the second.
flowchart LR
DS["profile_dataset.json<br/>(2 golden-set questions)"] --> EVAL["nat eval<br/>(profiler enabled)"]
EVAL --> FN["scout_research<br/>(M12 wrapper)"]
FN --> G["supervisor loop (M7): Planner (+critic) → auto-approve →<br/>Searcher → Reader → Fact-checker → Writer"]
G --> OUT["profile_output/<br/>workflow runtimes · bottleneck report"]
G -."per-node tokens & latency (plan B)".-> LF["Langfuse + usage log (M11)"]
The profiled run: NAT drives Scout through a wrapper and writes the execution profile; the per-node numbers come from NAT when it can see them, from the Module 11 instrumentation when it can’t.
That dotted line is an engineering note, not a footnote. NAT’s automatic
LLM instrumentation hooks framework layers — its LangChain handler sees
LLM and tool calls made through LangChain (likewise for LlamaIndex,
CrewAI…). Scout’s nodes call the raw openai SDK directly — a
deliberate course choice — so NAT profiles the workflow boundary (runtime
per question, concurrency, forecast) and sees nothing inside: no per-step
spans, no tokens. Plan B, assumed from the start: NAT for the
workflow-level execution profile, Module 11 for the per-node breakdown —
tokens and latency from the local usage log, the same numbers on every
Langfuse trace. The lab prints which path applied to your run. Knowing
where your instrumentation sits in the stack is half of observability.
The rest of the map: NeMo microservices and when this stack wins
Three NeMo products complete the map, and you should know them as concepts — they are generally available (GA), but NeMo Customizer (managed fine-tuning) and NeMo Evaluator (managed evaluation) run as microservices on Kubernetes with GPUs: real enterprise infrastructure, not laptop material. NeMo Curator (training-data curation: deduplication, filtering, quality scoring at corpus scale) is the partial exception — it ships as a pip library and much of it runs on CPU, though it’s still a data-pipeline tool, not an agent-runtime one.
Together they form NVIDIA’s data flywheel — a loop where production data improves the models that produce it: Curator prepares the data your deployed agents generate, Customizer fine-tunes on it, Evaluator validates the result, and the improved model redeploys behind the same NIM API. One paragraph here; one mapping question on the exam.
So when does the NVIDIA stack win? My honest read, criteria not cheerleading:
- Data residency or air-gapped requirements — self-hosted NIMs are the cleanest “same API, inside the perimeter” story available.
- Owned GPUs or committed volume — TensorRT-LLM engines extract real performance from hardware you’re already paying for.
- Enterprise support, one-vendor accountability — the suite is built to be bought and supported together.
- Already on Nemotron-class open models — integration cost near zero, as Scout proves.
And when it doesn’t: a small team on hosted APIs with no residency constraint gains little from self-hosting. The decision is the Module 10 framework, not loyalty.
Multimodal agents with Nemotron 3 Omni
Domain 2’s objective 2.2 — “Integrate generative and multimodal models (text, vision, audio)” — lands in this module, because the platform answer is an NVIDIA one. A multimodal model accepts more than text — images, audio, video — in its input messages and reasons over them jointly with text. The key architectural insight: the agent pattern does not change. Messages gain image or audio parts; tool calling, state, the supervisor loop, guardrails — all identical. Multimodality is an input-type upgrade, not a new architecture.
Where it touches a pipeline like Scout’s (objective 7.5’s territory): ingestion. A PDF with diagrams defeats a text-only Reader — the platform play is a vision-capable model interpreting figures at ingestion or query time, while the NeMo Retriever embedding NIM indexes the text. Speech adds audio NIMs at the edges (recognition in, synthesis out). The mapping — which component handles which modality where — is the exam skill.
The hosted catalog serves one Omni model as of June 2026, and the exact
ID matters — there is no bare “nemotron-3-omni”:
nvidia/nemotron-3-nano-omni-30b-a3b-reasoning (text, image, audio, video
in; text out). An optional taste, outside Scout’s flow, laptop only — the
snippet ships standalone as module-12/extras/vision_demo.py in the labs
repo (uv run resolves openai from the project). Put a PNG next to it
(a screenshot works — swap architecture.png for its name), then from
module-12/extras/ run uv run python vision_demo.py with
NVIDIA_API_KEY exported in your shell — this standalone snippet doesn’t
read .env:
# vision_demo.py — one vision call to the hosted Omni NIM (optional)
import base64, os
from openai import OpenAI
OMNI_MODEL = "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning" # the ONLY hosted Omni ID (June 2026)
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key=os.environ["NVIDIA_API_KEY"],
)
image_b64 = base64.b64encode(open("architecture.png", "rb").read()).decode()
response = client.chat.completions.create(
model=OMNI_MODEL,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this architecture diagram in two sentences."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
],
}],
max_tokens=2048, # Omni reasons before answering — keep headroom (Module 1's trap)
)
print(response.choices[0].message.content)
Same client, same endpoint, same message shape with one extra content part. That’s the whole point.
Hands-on lab: build it
Objective: profile Scout with the NeMo Agent Toolkit, then compare two
Nemotron 3 models on the Module 8 golden set so each role runs on a model
you chose with numbers. The full lab lives in
module-12/
of the labs repo.
Observable result: profile_scout.py prints the per-step runtime
breakdown and names the bottleneck; compare_models.py prints a
nano-vs-super table (judge score, tokens, seconds per question) with the
fixed-judge and bias caveats. Everything runs on a laptop against hosted
NIMs.
Step 1 — Install the toolkit
uv add --dev "nvidia-nat[eval,profiler,langchain]~=1.7.0" # strict pin: monthly releases move the API
uv add --dev --editable module-12/nvidia-variant # registers scout_research with the nat CLI
(If you cloned the labs repo, uv sync already did both.) nvidia-nat is
a meta-package: nat eval, the profiler, and the LangChain hooks each
live in an extra — hence the bracket list. The second line installs a
tiny plugin package as an editable dev dependency — the documented
way the nat CLI discovers custom workflows (entry point group
nat.components). Both lines target the dev group, kept out of the
runtime dependencies so the Module 10 Docker image (uv sync --no-dev)
never ships a profiler.
Step 2 — Register Scout without touching it
The entire NAT↔Scout bridge is one config class and one decorated
function in nvidia-variant/nat_scout/register.py:
class ScoutResearchConfig(FunctionBaseConfig, name="scout_research"):
"""`workflow._type: scout_research` in nat_config.yml maps to this."""
@register_function(
config_type=ScoutResearchConfig,
# Scout is LangGraph: this hooks NAT's profiler into the LangChain
# callback layer. That layer only sees LLM/tool calls made THROUGH
# LangChain — Scout's nodes use the raw `openai` SDK, so per-node
# numbers come from Module 11. Plan B, by design.
framework_wrappers=[LLMFrameworkEnum.LANGCHAIN],
)
async def scout_research(config: ScoutResearchConfig, builder: Builder):
from nat_scout.runner import run_research
async def _respond(question: str) -> str:
state = run_research(question)
return state.get("report") or "(no report produced)"
yield _respond
run_research() runs the frozen graph, auto-approves the Module 9 plan
interrupt with the frozen payload (approve, never edit, so you profile the
trajectory a user would approve), and isolates each run in a fresh
temp-dir knowledge base — the Module 8 harness’s hygiene, reused.
scout/ is not modified; nat_config.yml has no llms: section
because Scout reads its model from config.py — the one-file rule
survives a new toolkit.
Step 3 — Profile two golden-set questions
cd module-12
uv run python nvidia-variant/profile_scout.py --limit 2
The script builds a tiny dataset from your local evals/golden_set.json
(the frozen Module 8 copy — never re-authored), runs nat eval with the
profiler enabled, and prints what it found. A real run, June 2026:
-- Top 5 Calls by Bottleneck Score (subtree_time) --
1) UUID=..., FUNCTION '<workflow>', dur=244.00, self_time=244.00, subtree_time=244.00, concurrency=1.0, score=244.00
2) UUID=..., FUNCTION '<workflow>', dur=95.72, self_time=95.72, subtree_time=95.72, concurrency=1.0, score=95.72
Plan B (expected with raw-SDK nodes): NAT profiled the workflow
boundary but saw neither tokens nor per-step spans — its framework
hooks instrument LangChain calls, and Scout's nodes use the raw
`openai` SDK. Per-node numbers come from Module 11 instead:
--- per-node profile of the LAST question (Module 11 usage log) --------
node calls in tok out tok est. $ sec
----------------------------------------------------------
supervisor 5 4850 3508 0.0038 27.1
fact_checker 2 2717 2157 0.0023 13.2
planner 2 1106 2752 0.0024 15.5
writer 1 1192 2659 0.0024 13.0
critic 1 458 818 0.0007 4.9
reader 1 613 618 0.0006 4.5
searcher 1 729 171 0.0003 1.3
----------------------------------------------------------
TOTAL 13 11665 12683 0.0125 79.5
NAT’s rows are workflow-level — one span per question — because nothing inside Scout goes through the layer NAT instruments; the per-node table is Module 11 earning its keep. Note what the data did to the obvious guess: the expected bottleneck was the Reader or the Writer; on this run the biggest LLM-time spender is the supervisor — five routing turns at ~5 s each. Profiling exists to confirm, before you optimize.
Step 4 — nano vs super, judged
uv run python nvidia-variant/compare_models.py --limit 5
Five golden-set questions, one full Scout run per question per candidate.
The swap happens through config.CANDIDATE_MODELS — model names still
live in exactly one file. The Module 8 judge stays fixed for both runs —
our June 2026 run, in full:
model question grounding coverage citations tokens seconds
---------------------------------------------------------------------
nano r01 3 3 3 26378 102.1
nano f01 5 3 5 25216 96.7
nano c01 1 3 1 30570 144.5
nano m01 3 1 5 29680 154.0
nano f02 5 1 5 27773 133.4
nano MEAN 3.40 2.20 3.80 27923 126.1
super r01 5 3 5 22884 944.4
super f01 3 3 3 20189 487.1
super c01 3 3 3 23123 481.8
super m01 5 1 5 29064 850.7
super f02 5 3 5 29399 1038.6
super MEAN 4.20 2.60 4.20 24932 760.5
judge fixed for both runs: nvidia/nemotron-3-super-120b-a12b
caveat (Module 8): the judge shares weights with the 'super' candidate —
read that column with self-preference bias in mind.
That caveat is methodology, not decoration: the judge is the super model, so read its +0.8 grounding lift through that lens. Two findings survive the bias. Super used fewer tokens than nano (~25 k vs ~28 k — fewer wasted turns): “bigger model = bigger bill” is not automatic. And it paid sixfold in latency: 760 s mean per run against nano’s 126. Our deliverable sentence: “nano stays the worker — six times faster for most of the quality; super’s bias-tinted lift is worth it only where minutes don’t matter, and it keeps the judge seat.” Yours may differ — if the data says so.
Step 5 — Guided concept: self-hosting a NIM (read, don’t pay)
The lab’s final section walks through docker login nvcr.io (NGC,
NVIDIA’s container registry), running a
NIM container on a rented GPU (an NVIDIA Brev L4 runs ~$0.44–0.80/h as of
June 2026), and pointing Scout at it — which is
BASE_URL = "http://localhost:8000/v1" in config.py and nothing else.
Marked optional and paid; read it in
lab.md
so you can answer 7.2 questions from understanding, not memory.
Try it yourself (no solution provided):
- Probe the self-preference bias: set
JUDGE_MODEL = "nvidia/nemotron-3-ultra-550b-a55b"(one line inconfig.py) and re-run with--limit 2. Do the nano/super gaps move? Revert theJUDGE_MODELline afterwards — the smoke test pins it to super (test_candidate_models_contract). - Close the plan-B loop: in Langfuse, put each node’s wall-clock span beside its LLM seconds from the per-node table. Which node is slow because of the model, and which because of the network?
Exam corner
What the exam tests here. Per the official blueprint, NVIDIA Platform Implementation is 7% of the exam. The study guide’s objectives: integrate NeMo Guardrails for compliance and safety (7.1 — practiced in Module 9; here it’s placed on the map); deploy NIM microservices for high-performance inference (7.2); optimize workflows with the NeMo Agent Toolkit (7.3 — this lab); leverage TensorRT-LLM and Triton for latency reduction (7.4); manage multimodal input pipelines on NVIDIA hardware (7.5). This module also owns Domain 2’s objective 2.2 (multimodal models). Expect mapping questions: a need, four products, one right pairing.
Quiz — answers after question 5.
-
A team needs to: (a) block forbidden topics in agent output at runtime, (b) find which step of an agentic workflow burns the most tokens, (c) deduplicate and clean a 2 TB crawl before fine-tuning. Which products, in order?
- A) NeMo Evaluator, NeMo Customizer, NeMo Curator
- B) NeMo Guardrails, NeMo Agent Toolkit, NeMo Curator
- C) NeMo Guardrails, NeMo Evaluator, NeMo Customizer
- D) NeMo Agent Toolkit, Triton Inference Server, NeMo Curator
-
A hospital network requires that prompts and outputs never leave its own network. Its agents serve internal staff at moderate volume. Best inference setup?
- A) build.nvidia.com hosted endpoints with PII-masking guardrails
- B) NeMo Customizer, since fine-tuned models don’t need external calls
- C) Self-hosted NIM containers on the hospital’s GPUs — same OpenAI-compatible API, data stays inside the perimeter
- D) Triton Inference Server alone, since it has no external dependencies
-
A self-hosted GPU inference service shows p95 latency far above target on single, sequential requests. Which lever acts at the right level?
- A) Enable in-flight batching in Triton
- B) Serve a TensorRT-LLM engine with a quantized (lower-precision, e.g., 8-bit FP8) profile
- C) Add more API worker processes in front of the GPU
- D) Switch to a larger model for better answers per call
-
Assign Nemotron 3 models under a cost ceiling: (1) high-volume intent router, (2) LLM-as-judge for the eval harness, (3) latency-tolerant multi-document synthesis requiring hard reasoning.
- A) ultra for all three — strongest model, fewest surprises
- B) super for the router; nano as judge; ultra for the synthesis
- C) nano for all three — cheapest always wins
- D) nano for the router; super as judge; super for the synthesis
-
A pipeline ingests PDFs full of diagrams; the agent must answer questions about the figures. Which components go where?
- A) NeMo Curator extracts the diagrams at query time
- B) TensorRT-LLM converts the images to text before indexing
- C) A vision-capable Nemotron (Omni) NIM interprets figures at ingestion/query; the NeMo Retriever embedding NIM indexes the text
- D) Triton’s image backend makes a vision model unnecessary
Answers. 1 — B. Runtime topic control is Guardrails; per-step token profiling is the NeMo Agent Toolkit; training-data curation is Curator. The distractors test the classic blur: Evaluator evaluates models/pipelines, Customizer fine-tunes — neither blocks topics nor profiles workflows. 2 — C. Data residency decides it: self-hosted NIMs keep traffic inside while preserving the exact API. A still sends data out (masking ≠ residency). B confuses fine-tuning with serving. D would rebuild what a NIM already packages — engine selection, model, API surface. 3 — B. Quantization makes each forward pass cheaper — that’s single-request latency, TensorRT-LLM’s level. In-flight batching (A) lifts throughput under concurrency; sequential p95 barely moves. More workers (C) helps queueing, not GPU compute time. D raises latency. 4 — D. Role-based assignment: the router needs cheap-and-fast at volume (nano); the judge should outclass the workers it grades (super — Module 8’s rule); the synthesis earns the bigger model where reasoning is hard and latency tolerated. A and B ignore cost or invert roles; C ignores that judging and hard reasoning have a quality floor. 5 — C. Vision understanding needs a vision-capable model; indexing needs the embedding NIM. Curator (A) prepares training data, not query pipelines. TensorRT-LLM (B) is an engine, not an OCR system. Triton (D) serves models; it doesn’t replace one.
Traps to avoid:
- “NIM is a model.” A NIM packages a model with its optimized engine and serving stack behind a standard API. When an option treats NIM and model as interchangeable — or equates NIM with build.nvidia.com — it’s testing exactly this.
- Triton vs TensorRT-LLM, inverted. TensorRT-LLM compiles and optimizes (quantization, KV cache, fused kernels); Triton serves (in-flight batching, multi-model, metrics). Latency-per-request points at the engine; throughput-under-load points at the server.
- The old names. “AgentIQ” / “Agent Intelligence Toolkit (AIQ)” in the official readings = today’s NeMo Agent Toolkit. And “NeMo” alone names a family — Guardrails, Retriever, Curator, Customizer, Evaluator are distinct products with distinct jobs.
- Multimodality changes the message parts, not the agent pattern. An Omni model adds image or audio parts to messages; tool calling, state, rails, and the supervisor loop stay identical. An option demanding a “new architecture” for vision input is testing exactly this.
Key takeaways
- A NIM is a packaged inference microservice — model + TensorRT-LLM engine
- Triton + OpenAI-compatible API in one container; hosted build.nvidia.com is just one way to consume it.
- TensorRT-LLM optimizes the model (quantization, KV cache, fused kernels); Triton serves it (in-flight batching, multi-model, metrics) — latency vs throughput, engine vs server.
- Nemotron models are chosen by role: nano for volume and tool loops, super for hard reasoning and judging, ultra for the hardest problems.
- The NeMo Agent Toolkit (ex-AgentIQ/AIQ — keep both names) is point-in-time profiling; Langfuse is continuous production observability. You need both.
- NeMo is a family: Guardrails (rails), Retriever (embed/rerank), Curator (training data), Customizer (fine-tuning), Evaluator (managed evals).
- Customizer and Evaluator require Kubernetes + GPUs — concept-level for laptops; Curator is a pip library, partly CPU.
- Hosted vs self-hosted is an equation — residency, cost model, volume,
ops — and the
config.pycontract makes either answer a one-line change.
Keep going
Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.
Every piece is now on the table — built, evaluated, guarded, deployed, traced, profiled. Next: assemble, harden, and ship Scout v1.0.
Lab code · Course index · ← Module 11 · Module 13 →
References
- NCP-AAI certification page — the official blueprint; NVIDIA Platform Implementation is weighted at 7%.
- NVIDIA NIM documentation — the official docs hub for NIM microservices, from “how NIM works” to deployment guides.
- NVIDIA NeMo Agent Toolkit — GitHub —
the toolkit’s current home (v1.7.0, May 2026);
pip install nvidia-nat. - NeMo Agent Toolkit documentation — current docs (1.7), including the profiler and custom-function guides.
- Nemotron 3 Nano — API reference — the model card behind Scout’s default worker; the Omni variant is documented alongside it.
- NeMo Guardrails — GitHub — note the repo’s new home under the NVIDIA-NeMo org (v0.22) — older links redirect.