The NVIDIA Agentic Stack: NIM, NeMo, and Nemotron in Practice (NCP-AAI Module 12)

This is Module 12 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.

You have been building on the NVIDIA stack since Scout’s first API call, eleven modules ago — and you have probably never seen the whole map. Here is the exam question that punishes that: “Which NVIDIA product curates training data?” If you just hesitated between Curator, Customizer, and Evaluator, you left points on the table. NVIDIA Platform Implementation is only 7% of the NCP-AAI exam, but these are the cheapest points on the paper: pure product mapping, no scenario gymnastics — if you have the map. This module names every brick Scout already stands on, fills the gaps you haven’t touched (TensorRT-LLM, Triton, the NeMo microservices), then replaces feelings with numbers: you’ll profile Scout with NVIDIA’s own toolkit and benchmark two Nemotron models on your Module 8 golden set. By the end, “which product does what” is the easiest part of your exam.

In this module

You’ll learn:
- Map the NVIDIA agentic platform end to end — hardware → inference engine → serving → NIM → Nemotron → NeMo suite → NeMo Agent Toolkit — and place every brick Scout already uses.
- Deploy a self-hosted NIM (NVIDIA Inference Microservices) in guided concept: what the container ships, and when it beats the hosted endpoints.
- Optimize an agentic workflow with the NeMo Agent Toolkit: profile tokens and latency per step, name the bottleneck.
- Leverage TensorRT-LLM and Triton conceptually: which cuts latency, how, and which serves.
- Compare two Nemotron 3 models on the Module 8 golden set and assign a model to each role with data.
- Build agents on multimodal generative models (text, vision, audio) — with an optional vision call to Nemotron 3 Omni.
You’ll build: Profile Scout with the NeMo Agent Toolkit and benchmark two Nemotron 3 models on your Module 8 golden set.
Exam domains covered: D7 — NVIDIA Platform Implementation — 7% of the exam; D2 (partial) — Agent Development — objective 2.2 (multimodal models).
Prerequisites: Modules 1–11; the Module 8 golden set and judge (the comparison runs on them); NVIDIA API key configured. New setup: nvidia-nat via uv — laptop only, nothing paid.

Where you are

✅ Modules 1–7 — from a first NIM call to a supervisor team with Planner, Searcher, Reader, Fact-checker, Writer
✅ Modules 8–11 — evals and golden set, guardrails + HITL, FastAPI + Docker, tracing and cost observability
👉 Module 12 — The NVIDIA Agentic Stack (you are here)
⬜ Module 13 — capstone: assemble, harden, ship Scout v1.0
⬜ Module 14 — the exam: strategy, mock exam, debrief

Scout before: complete and observable, but never profiled — running on the model it started with, chosen by default. Scout after: profiled with NVIDIA’s official toolkit, bottleneck named, a model per role chosen with numbers — and you hold the platform map that Domain 7 tests.

The NVIDIA agentic platform, mapped

The platform is a layer cake, and you have been eating from the middle of it since Module 1. Read it bottom-up:

flowchart BT
    HW["Hardware — DGX systems, cloud GPUs<br/>(Brev: GPUs on demand · LaunchPad: NVIDIA's guided-lab environment)"]
    TRT["TensorRT-LLM — compiles & optimizes the model<br/>(quantization, KV cache, fused kernels)"]
    TRITON["Triton Inference Server — serves models at scale<br/>(in-flight batching, multi-model, metrics)"]
    NIM["NIM — model + engine + server in one container,<br/>OpenAI-compatible API · Scout: every LLM call since M1"]
    MODELS["Nemotron 3 — nano / super / ultra (+ nano-omni)<br/>Scout: worker since M1, judge since M8"]
    NEMO["NeMo suite — Guardrails (M9) · Retriever (M6) ·<br/>Curator · Customizer · Evaluator (concept)"]
    NAT["NeMo Agent Toolkit — profile, evaluate, connect<br/>Scout: M12 (this module)"]
    HW --> TRT --> TRITON --> NIM --> MODELS --> NEMO --> NAT

Bottom-up: silicon; the engine that optimizes a model for it; the server that serves it; the container that packages all three; the models; the product suite around them; the toolkit that measures the whole thing.

The table below is the highest-value Domain 7 asset — the exam asks “which product does X” in almost exactly this shape:

Product	What it does (one line)	Where Scout uses it
NIM	Containerized inference microservice: model + optimized engine + server behind an OpenAI-compatible API	Every LLM call since Module 1 (hosted)
Nemotron	NVIDIA’s open model family built for agentic work (reasoning, tool calling)	Worker since M1; judge since Module 8
NeMo Guardrails	Programmable input/dialog/output rails for LLM apps	Module 9
NeMo Retriever	Embedding and reranking NIMs for retrieval pipelines	Module 6
NeMo Curator	Curates training data: dedup, filtering, cleaning at scale (pip library, partly CPU)	Concept (this module)
NeMo Customizer	Managed fine-tuning of models (microservice; Kubernetes + GPU)	Concept (this module)
NeMo Evaluator	Managed evaluation of models and pipelines (microservice; Kubernetes + GPU)	Concept — Scout uses its own M8 harness
NeMo Agent Toolkit	Framework-agnostic profiling, evaluation, and connection layer for agent workflows	This module’s lab
TensorRT-LLM	Compiles and optimizes a model for a target GPU (quantization, KV cache, fused kernels)	Concept — inside every NIM Scout calls
Triton Inference Server	Serves models in production: in-flight batching, multi-model, metrics	Concept — inside every NIM Scout calls
Brev / DGX	GPU access: on-demand cloud GPUs / NVIDIA’s AI infrastructure systems	Concept — the lab’s self-host walkthrough

Two reading rules defuse most D7 traps. First, NeMo is a family of distinct products, not a product — “we used NeMo” is as precise as “we used AWS”. Second, the layers don’t substitute: Triton doesn’t optimize, TensorRT-LLM doesn’t serve, and a NIM is neither a model nor a website.

NIM: what’s actually in the box — hosted vs self-hosted

NVIDIA Inference Microservices (NIM) are containerized inference services: one Docker container packaging a model, an inference engine pre-optimized for the GPU it detects at startup, a production server, and an OpenAI-compatible API on top — the same /v1/chat/completions surface you’ve been calling all course.

Two components inside have names the exam expects you to keep straight. TensorRT-LLM is the compiler and optimizer: it takes model weights and builds a GPU-specific engine, applying quantization (lower-precision weights and activations — smaller, faster math at a controlled accuracy cost), KV-cache management (reusing the attention keys and values already computed instead of recomputing them for every token), and fused kernels (several GPU operations merged into a single launch). It makes each forward pass cheaper. Triton Inference Server is the server: it loads engines and handles production serving — in-flight batching (new requests join a running batch instead of waiting for it to finish, lifting GPU utilization and throughput), multi-model hosting, metrics. One optimizes the model; the other runs the traffic. A latency question is usually a TensorRT-LLM question; a throughput-under-concurrency question is usually a Triton question.

So what is build.nvidia.com? NVIDIA hosting these same NIMs and renting you the API: identical surface, zero infrastructure, free-tier rate limits (40 req/min as of June 2026). Module 10 already gave you the decision framework; here it is with NVIDIA names:

Criterion	Hosted endpoints (build.nvidia.com)	Self-hosted NIM
Cost model	Variable: per-call/credits, zero idle cost	Fixed: GPU runs whether busy or not
Data residency	Prompts and outputs leave your network	Everything stays inside your perimeter
Latency	Internet round-trip, shared capacity	Local network, dedicated GPU
Throughput	Rate-limited (40 req/min free tier, June 2026)	Your hardware is the limit
Ops burden	None	Containers, drivers, GPU capacity, upgrades
Best when	Prototyping, spiky/low volume, no residency constraint	Compliance, air-gapped networks, sustained volume

The Nemotron family: picking the right model for each role

Nemotron is NVIDIA’s open model family tuned for agentic work — reasoning, tool calling, long context. The current generation is Nemotron 3, a Mixture-of-Experts (MoE) lineup: only a fraction of the weights — the active parameters, the a3b in the IDs — fires per token, so cost and latency track a dense model of that active size, not the headline one. Two of its hosted NIMs have lived in config.py all course; ultra completes the family:

Model	Shape	Strengths	Typical agent role	Cost / latency
`nvidia/nemotron-3-nano-30b-a3b`	MoE, ~3B active params	Fast, cheap, strong tool calling, 1M context	High-volume routing, tool loops, workers	Lowest
`nvidia/nemotron-3-super-120b-a12b`	MoE, ~12B active params	Deeper multi-step reasoning, synthesis	Hard reasoning steps; LLM-as-judge	Mid
`nvidia/nemotron-3-ultra-550b-a55b`	MoE, ~55B active params	Strongest reasoning in the family	Hardest problems, offline/batch work	Highest

The lineup is role-based, not “good/better/best for everything” — exactly how the exam frames model-selection questions, and exactly how the lab makes you decide: nano has been Scout’s worker since Module 1, super the judge since Module 8, and today you’ll test whether those defaults survive contact with your golden set. Because model names live only in config.py (the contract frozen in Module 1), the swap is a one-line change — twelve modules of discipline paying off in one lab.

Market context, as of June 2026: at GTC 2026 NVIDIA announced the Nemotron Coalition — Nemotron 4 co-developed with partners including Mistral, LangChain, Cursor, and Perplexity — alongside native Nemotron-plus-NAT (NeMo Agent Toolkit) integration in LangChain. For the exam, generation 3 is what the hosted catalog serves.

NeMo Agent Toolkit: profiling your agent like NVIDIA does

The NeMo Agent Toolkit (NAT) is NVIDIA’s framework-agnostic layer for working across agent stacks: it connects to workflows written in LangGraph, CrewAI, LlamaIndex and others, and gives you a profiler (runtimes, token usage, bottleneck analysis per step), an evaluator, and MCP client/server support. It doesn’t replace your framework — Scout stays pure LangGraph — it wraps it and measures it. Install is pip install nvidia-nat; the labs pin ~=1.7.0 (i.e. >=1.7.0,<1.8) because the toolkit releases monthly and the API moves.

One naming warning, genuinely exam-relevant: NAT used to be called AgentIQ, then the Agent Intelligence Toolkit (AIQ) — and the official study guide’s recommended readings still use the old names and old doc URLs. Same product, renamed. Don’t let the rename cost you a known answer.

Draw the boundary with Module 11 precisely, because the two tools look superficially similar. Langfuse is continuous production observability: always on, every request, dashboards, alerts — the flight recorder. The NAT profiler is point-in-time optimization analysis: a controlled batch, run on purpose, to find the bottleneck before you change something — the wind tunnel. You need both; Domain 7’s objective 7.3 (“optimize workflows with the NeMo Agent Toolkit”) is about the second.

flowchart LR
    DS["profile_dataset.json<br/>(2 golden-set questions)"] --> EVAL["nat eval<br/>(profiler enabled)"]
    EVAL --> FN["scout_research<br/>(M12 wrapper)"]
    FN --> G["supervisor loop (M7): Planner (+critic) → auto-approve →<br/>Searcher → Reader → Fact-checker → Writer"]
    G --> OUT["profile_output/<br/>workflow runtimes · bottleneck report"]
    G -."per-node tokens & latency (plan B)".-> LF["Langfuse + usage log (M11)"]

The profiled run: NAT drives Scout through a wrapper and writes the execution profile; the per-node numbers come from NAT when it can see them, from the Module 11 instrumentation when it can’t.

That dotted line is an engineering note, not a footnote. NAT’s automatic LLM instrumentation hooks framework layers — its LangChain handler sees LLM and tool calls made through LangChain (likewise for LlamaIndex, CrewAI…). Scout’s nodes call the raw openai SDK directly — a deliberate course choice — so NAT profiles the workflow boundary (runtime per question, concurrency, forecast) and sees nothing inside: no per-step spans, no tokens. Plan B, assumed from the start: NAT for the workflow-level execution profile, Module 11 for the per-node breakdown — tokens and latency from the local usage log, the same numbers on every Langfuse trace. The lab prints which path applied to your run. Knowing where your instrumentation sits in the stack is half of observability.

The rest of the map: NeMo microservices and when this stack wins

Three NeMo products complete the map, and you should know them as concepts — they are generally available (GA), but NeMo Customizer (managed fine-tuning) and NeMo Evaluator (managed evaluation) run as microservices on Kubernetes with GPUs: real enterprise infrastructure, not laptop material. NeMo Curator (training-data curation: deduplication, filtering, quality scoring at corpus scale) is the partial exception — it ships as a pip library and much of it runs on CPU, though it’s still a data-pipeline tool, not an agent-runtime one.

Together they form NVIDIA’s data flywheel — a loop where production data improves the models that produce it: Curator prepares the data your deployed agents generate, Customizer fine-tunes on it, Evaluator validates the result, and the improved model redeploys behind the same NIM API. One paragraph here; one mapping question on the exam.

So when does the NVIDIA stack win? My honest read, criteria not cheerleading:

Data residency or air-gapped requirements — self-hosted NIMs are the cleanest “same API, inside the perimeter” story available.
Owned GPUs or committed volume — TensorRT-LLM engines extract real performance from hardware you’re already paying for.
Enterprise support, one-vendor accountability — the suite is built to be bought and supported together.
Already on Nemotron-class open models — integration cost near zero, as Scout proves.

And when it doesn’t: a small team on hosted APIs with no residency constraint gains little from self-hosting. The decision is the Module 10 framework, not loyalty.

Multimodal agents with Nemotron 3 Omni

Domain 2’s objective 2.2 — “Integrate generative and multimodal models (text, vision, audio)” — lands in this module, because the platform answer is an NVIDIA one. A multimodal model accepts more than text — images, audio, video — in its input messages and reasons over them jointly with text. The key architectural insight: the agent pattern does not change. Messages gain image or audio parts; tool calling, state, the supervisor loop, guardrails — all identical. Multimodality is an input-type upgrade, not a new architecture.

Where it touches a pipeline like Scout’s (objective 7.5’s territory): ingestion. A PDF with diagrams defeats a text-only Reader — the platform play is a vision-capable model interpreting figures at ingestion or query time, while the NeMo Retriever embedding NIM indexes the text. Speech adds audio NIMs at the edges (recognition in, synthesis out). The mapping — which component handles which modality where — is the exam skill.

The hosted catalog serves one Omni model as of June 2026, and the exact ID matters — there is no bare “nemotron-3-omni”: nvidia/nemotron-3-nano-omni-30b-a3b-reasoning (text, image, audio, video in; text out). An optional taste, outside Scout’s flow, laptop only — the snippet ships standalone as module-12/extras/vision_demo.py in the labs repo (uv run resolves openai from the project). Put a PNG next to it (a screenshot works — swap architecture.png for its name), then from module-12/extras/ run uv run python vision_demo.py with NVIDIA_API_KEY exported in your shell — this standalone snippet doesn’t read .env:

# vision_demo.py — one vision call to the hosted Omni NIM (optional)
import base64, os
from openai import OpenAI

OMNI_MODEL = "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"  # the ONLY hosted Omni ID (June 2026)

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=os.environ["NVIDIA_API_KEY"],
)
image_b64 = base64.b64encode(open("architecture.png", "rb").read()).decode()
response = client.chat.completions.create(
    model=OMNI_MODEL,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this architecture diagram in two sentences."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
        ],
    }],
    max_tokens=2048,  # Omni reasons before answering — keep headroom (Module 1's trap)
)
print(response.choices[0].message.content)

Same client, same endpoint, same message shape with one extra content part. That’s the whole point.

Hands-on lab: build it

Objective: profile Scout with the NeMo Agent Toolkit, then compare two Nemotron 3 models on the Module 8 golden set so each role runs on a model you chose with numbers. The full lab lives in module-12/ of the labs repo.

Observable result: profile_scout.py prints the per-step runtime breakdown and names the bottleneck; compare_models.py prints a nano-vs-super table (judge score, tokens, seconds per question) with the fixed-judge and bias caveats. Everything runs on a laptop against hosted NIMs.

Step 1 — Install the toolkit

uv add --dev "nvidia-nat[eval,profiler,langchain]~=1.7.0"  # strict pin: monthly releases move the API
uv add --dev --editable module-12/nvidia-variant    # registers scout_research with the nat CLI

(If you cloned the labs repo, uv sync already did both.) nvidia-nat is a meta-package: nat eval, the profiler, and the LangChain hooks each live in an extra — hence the bracket list. The second line installs a tiny plugin package as an editable dev dependency — the documented way the nat CLI discovers custom workflows (entry point group nat.components). Both lines target the dev group, kept out of the runtime dependencies so the Module 10 Docker image (uv sync --no-dev) never ships a profiler.

Step 2 — Register Scout without touching it

The entire NAT↔Scout bridge is one config class and one decorated function in nvidia-variant/nat_scout/register.py:

class ScoutResearchConfig(FunctionBaseConfig, name="scout_research"):
    """`workflow._type: scout_research` in nat_config.yml maps to this."""

@register_function(
    config_type=ScoutResearchConfig,
    # Scout is LangGraph: this hooks NAT's profiler into the LangChain
    # callback layer. That layer only sees LLM/tool calls made THROUGH
    # LangChain — Scout's nodes use the raw `openai` SDK, so per-node
    # numbers come from Module 11. Plan B, by design.
    framework_wrappers=[LLMFrameworkEnum.LANGCHAIN],
)
async def scout_research(config: ScoutResearchConfig, builder: Builder):
    from nat_scout.runner import run_research

    async def _respond(question: str) -> str:
        state = run_research(question)
        return state.get("report") or "(no report produced)"

    yield _respond

run_research() runs the frozen graph, auto-approves the Module 9 plan interrupt with the frozen payload (approve, never edit, so you profile the trajectory a user would approve), and isolates each run in a fresh temp-dir knowledge base — the Module 8 harness’s hygiene, reused. scout/ is not modified; nat_config.yml has no llms: section because Scout reads its model from config.py — the one-file rule survives a new toolkit.

Step 3 — Profile two golden-set questions

cd module-12
uv run python nvidia-variant/profile_scout.py --limit 2

The script builds a tiny dataset from your local evals/golden_set.json (the frozen Module 8 copy — never re-authored), runs nat eval with the profiler enabled, and prints what it found. A real run, June 2026:

-- Top 5 Calls by Bottleneck Score (subtree_time) --
1) UUID=..., FUNCTION '<workflow>', dur=244.00, self_time=244.00, subtree_time=244.00, concurrency=1.0, score=244.00
2) UUID=..., FUNCTION '<workflow>', dur=95.72,  self_time=95.72,  subtree_time=95.72,  concurrency=1.0, score=95.72

Plan B (expected with raw-SDK nodes): NAT profiled the workflow
boundary but saw neither tokens nor per-step spans — its framework
hooks instrument LangChain calls, and Scout's nodes use the raw
`openai` SDK. Per-node numbers come from Module 11 instead:

--- per-node profile of the LAST question (Module 11 usage log) --------
node             calls   in tok  out tok    est. $     sec
----------------------------------------------------------
supervisor           5     4850     3508    0.0038    27.1
fact_checker         2     2717     2157    0.0023    13.2
planner              2     1106     2752    0.0024    15.5
writer               1     1192     2659    0.0024    13.0
critic               1      458      818    0.0007     4.9
reader               1      613      618    0.0006     4.5
searcher             1      729      171    0.0003     1.3
----------------------------------------------------------
TOTAL               13    11665    12683    0.0125    79.5

NAT’s rows are workflow-level — one span per question — because nothing inside Scout goes through the layer NAT instruments; the per-node table is Module 11 earning its keep. Note what the data did to the obvious guess: the expected bottleneck was the Reader or the Writer; on this run the biggest LLM-time spender is the supervisor — five routing turns at ~5 s each. Profiling exists to confirm, before you optimize.

Step 4 — nano vs super, judged

uv run python nvidia-variant/compare_models.py --limit 5

Five golden-set questions, one full Scout run per question per candidate. The swap happens through config.CANDIDATE_MODELS — model names still live in exactly one file. The Module 8 judge stays fixed for both runs — our June 2026 run, in full:

model   question     grounding  coverage citations    tokens  seconds
---------------------------------------------------------------------
nano    r01                  3         3         3     26378    102.1
nano    f01                  5         3         5     25216     96.7
nano    c01                  1         3         1     30570    144.5
nano    m01                  3         1         5     29680    154.0
nano    f02                  5         1         5     27773    133.4
nano    MEAN              3.40      2.20      3.80     27923    126.1

super   r01                  5         3         5     22884    944.4
super   f01                  3         3         3     20189    487.1
super   c01                  3         3         3     23123    481.8
super   m01                  5         1         5     29064    850.7
super   f02                  5         3         5     29399   1038.6
super   MEAN              4.20      2.60      4.20     24932    760.5

judge fixed for both runs: nvidia/nemotron-3-super-120b-a12b
caveat (Module 8): the judge shares weights with the 'super' candidate —
read that column with self-preference bias in mind.

That caveat is methodology, not decoration: the judge is the super model, so read its +0.8 grounding lift through that lens. Two findings survive the bias. Super used fewer tokens than nano (~25 k vs ~28 k — fewer wasted turns): “bigger model = bigger bill” is not automatic. And it paid sixfold in latency: 760 s mean per run against nano’s 126. Our deliverable sentence: “nano stays the worker — six times faster for most of the quality; super’s bias-tinted lift is worth it only where minutes don’t matter, and it keeps the judge seat.” Yours may differ — if the data says so.

Step 5 — Guided concept: self-hosting a NIM (read, don’t pay)

The lab’s final section walks through docker login nvcr.io (NGC, NVIDIA’s container registry), running a NIM container on a rented GPU (an NVIDIA Brev L4 runs ~$0.44–0.80/h as of June 2026), and pointing Scout at it — which is BASE_URL = "http://localhost:8000/v1" in config.py and nothing else. Marked optional and paid; read it in lab.md so you can answer 7.2 questions from understanding, not memory.

Try it yourself (no solution provided):

Probe the self-preference bias: set JUDGE_MODEL = "nvidia/nemotron-3-ultra-550b-a55b" (one line in config.py) and re-run with --limit 2. Do the nano/super gaps move? Revert the JUDGE_MODEL line afterwards — the smoke test pins it to super (test_candidate_models_contract).
Close the plan-B loop: in Langfuse, put each node’s wall-clock span beside its LLM seconds from the per-node table. Which node is slow because of the model, and which because of the network?

In production

An enterprise NVIDIA deployment is this module at fleet scale: dozens of self-hosted NIMs on Kubernetes (the GPU operator handles drivers and device plumbing), images pulled from a private registry on NGC, autoscaling on GPU utilization rather than CPU. Inside each NIM, the TensorRT-LLM engine profile (quantization included) and Triton’s in-flight batching are what you tune fleet-wide; multimodal pipelines add audio and vision NIMs at the edges of the same fleet. NAT profiling stays point-in-time — before an optimization push, or in CI — while Langfuse watches continuously. The data flywheel stops being a diagram: production traces feed Curator, Customizer fine-tunes a domain-specific Nemotron, Evaluator gates the release, and the improved model rolls out behind the same NIM API — no application change. Support and licensing run through NVIDIA AI Enterprise (check current terms — they evolve). The honest economics: this stack earns its keep at scale — sustained volume, compliance constraints, owned GPUs. For a side project or a low-traffic internal tool, hosted endpoints remain the rational answer. The skill the exam (and your CTO) wants is knowing which situation you’re in.

Exam corner

What the exam tests here. Per the official blueprint, NVIDIA Platform Implementation is 7% of the exam. The study guide’s objectives: integrate NeMo Guardrails for compliance and safety (7.1 — practiced in Module 9; here it’s placed on the map); deploy NIM microservices for high-performance inference (7.2); optimize workflows with the NeMo Agent Toolkit (7.3 — this lab); leverage TensorRT-LLM and Triton for latency reduction (7.4); manage multimodal input pipelines on NVIDIA hardware (7.5). This module also owns Domain 2’s objective 2.2 (multimodal models). Expect mapping questions: a need, four products, one right pairing.

Quiz — answers after question 5.

A team needs to: (a) block forbidden topics in agent output at runtime, (b) find which step of an agentic workflow burns the most tokens, (c) deduplicate and clean a 2 TB crawl before fine-tuning. Which products, in order?
- A) NeMo Evaluator, NeMo Customizer, NeMo Curator
- B) NeMo Guardrails, NeMo Agent Toolkit, NeMo Curator
- C) NeMo Guardrails, NeMo Evaluator, NeMo Customizer
- D) NeMo Agent Toolkit, Triton Inference Server, NeMo Curator
A hospital network requires that prompts and outputs never leave its own network. Its agents serve internal staff at moderate volume. Best inference setup?
- A) build.nvidia.com hosted endpoints with PII-masking guardrails
- B) NeMo Customizer, since fine-tuned models don’t need external calls
- C) Self-hosted NIM containers on the hospital’s GPUs — same OpenAI-compatible API, data stays inside the perimeter
- D) Triton Inference Server alone, since it has no external dependencies
A self-hosted GPU inference service shows p95 latency far above target on single, sequential requests. Which lever acts at the right level?
- A) Enable in-flight batching in Triton
- B) Serve a TensorRT-LLM engine with a quantized (lower-precision, e.g., 8-bit FP8) profile
- C) Add more API worker processes in front of the GPU
- D) Switch to a larger model for better answers per call
Assign Nemotron 3 models under a cost ceiling: (1) high-volume intent router, (2) LLM-as-judge for the eval harness, (3) latency-tolerant multi-document synthesis requiring hard reasoning.
- A) ultra for all three — strongest model, fewest surprises
- B) super for the router; nano as judge; ultra for the synthesis
- C) nano for all three — cheapest always wins
- D) nano for the router; super as judge; super for the synthesis
A pipeline ingests PDFs full of diagrams; the agent must answer questions about the figures. Which components go where?
- A) NeMo Curator extracts the diagrams at query time
- B) TensorRT-LLM converts the images to text before indexing
- C) A vision-capable Nemotron (Omni) NIM interprets figures at ingestion/query; the NeMo Retriever embedding NIM indexes the text
- D) Triton’s image backend makes a vision model unnecessary

Answers. 1 — B. Runtime topic control is Guardrails; per-step token profiling is the NeMo Agent Toolkit; training-data curation is Curator. The distractors test the classic blur: Evaluator evaluates models/pipelines, Customizer fine-tunes — neither blocks topics nor profiles workflows. 2 — C. Data residency decides it: self-hosted NIMs keep traffic inside while preserving the exact API. A still sends data out (masking ≠ residency). B confuses fine-tuning with serving. D would rebuild what a NIM already packages — engine selection, model, API surface. 3 — B. Quantization makes each forward pass cheaper — that’s single-request latency, TensorRT-LLM’s level. In-flight batching (A) lifts throughput under concurrency; sequential p95 barely moves. More workers (C) helps queueing, not GPU compute time. D raises latency. 4 — D. Role-based assignment: the router needs cheap-and-fast at volume (nano); the judge should outclass the workers it grades (super — Module 8’s rule); the synthesis earns the bigger model where reasoning is hard and latency tolerated. A and B ignore cost or invert roles; C ignores that judging and hard reasoning have a quality floor. 5 — C. Vision understanding needs a vision-capable model; indexing needs the embedding NIM. Curator (A) prepares training data, not query pipelines. TensorRT-LLM (B) is an engine, not an OCR system. Triton (D) serves models; it doesn’t replace one.

Traps to avoid:

“NIM is a model.” A NIM packages a model with its optimized engine and serving stack behind a standard API. When an option treats NIM and model as interchangeable — or equates NIM with build.nvidia.com — it’s testing exactly this.
Triton vs TensorRT-LLM, inverted. TensorRT-LLM compiles and optimizes (quantization, KV cache, fused kernels); Triton serves (in-flight batching, multi-model, metrics). Latency-per-request points at the engine; throughput-under-load points at the server.
The old names. “AgentIQ” / “Agent Intelligence Toolkit (AIQ)” in the official readings = today’s NeMo Agent Toolkit. And “NeMo” alone names a family — Guardrails, Retriever, Curator, Customizer, Evaluator are distinct products with distinct jobs.
Multimodality changes the message parts, not the agent pattern. An Omni model adds image or audio parts to messages; tool calling, state, rails, and the supervisor loop stay identical. An option demanding a “new architecture” for vision input is testing exactly this.

Key takeaways

A NIM is a packaged inference microservice — model + TensorRT-LLM engine
- Triton + OpenAI-compatible API in one container; hosted build.nvidia.com is just one way to consume it.
TensorRT-LLM optimizes the model (quantization, KV cache, fused kernels); Triton serves it (in-flight batching, multi-model, metrics) — latency vs throughput, engine vs server.
Nemotron models are chosen by role: nano for volume and tool loops, super for hard reasoning and judging, ultra for the hardest problems.
The NeMo Agent Toolkit (ex-AgentIQ/AIQ — keep both names) is point-in-time profiling; Langfuse is continuous production observability. You need both.
NeMo is a family: Guardrails (rails), Retriever (embed/rerank), Curator (training data), Customizer (fine-tuning), Evaluator (managed evals).
Customizer and Evaluator require Kubernetes + GPUs — concept-level for laptops; Curator is a pip library, partly CPU.
Hosted vs self-hosted is an equation — residency, cost model, volume, ops — and the config.py contract makes either answer a one-line change.

Keep going

Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.

Every piece is now on the table — built, evaluated, guarded, deployed, traced, profiled. Next: assemble, harden, and ship Scout v1.0.

Lab code · Course index · ← Module 11 · Module 13 →

References

NCP-AAI certification page — the official blueprint; NVIDIA Platform Implementation is weighted at 7%.
NVIDIA NIM documentation — the official docs hub for NIM microservices, from “how NIM works” to deployment guides.
NVIDIA NeMo Agent Toolkit — GitHub — the toolkit’s current home (v1.7.0, May 2026); pip install nvidia-nat.
NeMo Agent Toolkit documentation — current docs (1.7), including the profiler and custom-function guides.
Nemotron 3 Nano — API reference — the model card behind Scout’s default worker; the Omni variant is documented alongside it.
NeMo Guardrails — GitHub — note the repo’s new home under the NVIDIA-NeMo org (v0.22) — older links redirect.