Glossary

Exam terms are added here as modules are published, with the module that introduces each term. Terminology follows the official NVIDIA blueprint vocabulary.

A2A (Agent2Agent) — the open standard for making independent agents — different vendors, different frameworks — talk to each other. (Module 1)

Agent / agentic workflow — a system where the model decides the next step in a loop toward a goal: which action, with what input, and whether it’s done. (Module 1)

Agent card — An A2A agent’s published description of its skills and endpoint, used by other agents for discovery. (Module 7)

Agent-washing — misnaming in either direction: calling a fixed workflow an “agent”, or building a real agent for a task whose steps could have been written down in advance. (Module 3)

Agentic RAG — RAG in which retrieval is a tool the agent decides to call, skip, retry, or fall back from — not a fixed pipeline stage. (Module 6)

Anchored rubric — A scoring rubric in which every dimension carries concrete per-level descriptions, preventing judge-score drift. (Module 8)

Approximate nearest neighbor (ANN) — Index structures (e.g., HNSW graphs) that trade a sliver of exactness for sub-linear similarity lookup at scale. (Module 6)

Architecture Decision Record (ADR) — a design doc kept in the repo: the decision, the rejected alternatives, and why — superseded rather than edited. (Module 3)

Asymmetric embedding model — An embedding model with two projection modes — input_type=“passage” for indexing documents and input_type=“query” for searching — that silently degrades retrieval if mixed up. (Module 6)

Asynchronous execution — Serving pattern where the server accepts work and returns immediately; the work completes on its own schedule, decoupled from any request. (Module 10)

Audit trail — An append-only, timestamped record of who decided what, when — every rail trigger, plan presented, and human decision — kept for compliance and accountability. (Module 9)

Autonomy levels — The spectrum from full autonomy through human-on-the-loop and human- in-the-loop to human-only, chosen per task by reversibility, error cost, regulation, and frequency. (Module 9)

Autoscaling — Automatically adjusting replica count to follow load. (Module 10)

Certiverse — The third-party platform that sells, schedules, and delivers the NCP-AAI exam online. (Module 14)

Chain-of-thought (CoT) — Prompting a model to reason step by step in text before committing to an answer — reasoning only, no actions. (Module 4)

Checkpointer — A persistence layer that saves a snapshot of the full graph state after every super-step, so a run can be resumed, inspected, or replayed instead of restarted. (Module 5)

Chunking — Splitting documents into retrieval-sized pieces that are embedded and indexed individually. (Module 6)

Circuit breaker — after repeated failures, stop calling a dependency entirely and probe it occasionally until it recovers. (Module 2)

Conditional edge — a LangGraph edge whose target is decided at runtime from the state — a routing rule, declared instead of buried in an if. (Module 2)

Containerization — Packaging an application with its entire runtime into one immutable image that runs identically on any machine with a container engine. (Module 10)

Context engineering — Treating what enters the model’s context as a deliberate, budgeted design decision rather than letting history accumulate by default. (Module 5)

Context isolation — Designing each agent to see only the state its step needs, so instructions don’t drown in irrelevant context. (Module 7)

Continuous evaluation — Running the evaluation harness against the live, deployed agent on a schedule — same golden set, same judge — with scores tracked over time and alarmed against a threshold. (Module 11)

Coordination costs — The extra LLM calls, latency, and failure modes every handoff adds — the counterweight to going multi-agent. (Module 7)

Data exfiltration — An attack where the agent is made to leak what it knows through an output channel, such as a fetched URL or a published report. (Module 9)

Data flywheel — The loop where production data is curated (Curator), used for fine- tuning (Customizer), validated (Evaluator), and redeployed so deployed models keep improving. (Module 12)

Deliberative system — builds an internal representation of the situation and plans several steps before acting. (Module 1)

Dense retrieval — Retrieval by embedding similarity — matches meaning, but can miss exact tokens like names and version strings. (Module 6)

Deterministic checks — Pure-code assertions on an agent’s output that cost zero tokens and run before any LLM-judge call. (Module 8)

Dialog rail — A guardrail that constrains the conversation’s path — which topics and flows are allowed. (Module 9)

Distractor — A plausible wrong answer option engineered to attract a specific misunderstanding. (Module 14)

Drift — Quality change without a deployment: the hosted model, the data sources, or third-party tools moved while your code did not. (Module 11)

Dynamic prompt chain — a chain of prompts where each one is constructed at runtime from the results of the previous step; the ReAct loop’s accumulating messages is the canonical example. (Module 2)

Edge — declares which node runs next in a graph. (Module 2)

Embedding — A vector representation of text in which semantic similarity becomes geometric proximity. (Module 6)

Enterprise policy rails — Rails encoding what an organization refuses to do regardless of user requests — forbidden topics, regulated advice, off-limits targets. (Module 9)

Episodic memory — Long-term memory of experiences — the trace of what the agent did and what happened, consulted to adapt future behavior. (Module 5)

Escalation protocol — The defined response when a rail fires: block, reformulate, or escalate to a human reviewer. (Module 9)

ETL (extract, transform, load) — The data-engineering pipeline pattern that pulls raw data from a source, cleans and reshapes it, and loads it into a queryable store. (Module 6)

EU AI Act — Regulation (EU) 2024/1689, the EU’s AI regulatory framework — in force since 2024, mandating among other things human oversight and record-keeping for high-risk AI systems. (Module 9)

Execution rail — A guardrail that validates tool and action calls before and after they run. (Module 9)

Flag & return — Marking uncertain questions to revisit on a second pass instead of grinding on them. (Module 14)

Golden set — A versioned collection of representative questions with expected results, re-run after every change to detect improvement or regression. (Module 8)

Graceful failure — Ending an over-budget or faulted run cleanly with the work completed so far (a partial, still-cited report) instead of a hang or an empty crash. (Module 13)

Grounding — Ensuring every claim in a generated answer is supported by retrieved source material that can be checked. (Module 6)

Guardrails — Programmable controls that enforce constraints on an LLM application from outside the model, inspecting inputs, outputs, and intermediate steps rather than asking the model to police itself. (Module 9)

Handoff — The transfer of control plus context from one agent to another. (Module 7)

Hierarchical orchestration — Supervisors of supervisors — domain subtrees coordinated by a top-level supervisor, justified only at the scale of tens of specialists. (Module 7)

Hierarchy (multi-agent) — supervisors of supervisors: each subtree owns a domain; a top-level supervisor delegates between subtrees. Earns its keep only at scale. (Module 3)

High availability — Property that a service survives the failure of any single component, via redundancy (≥2 replicas), health checks, and routing around failures. (Module 10)

Human-in-the-loop (HITL) — An oversight pattern where a human decision is a blocking step inside the agent’s workflow: the run cannot proceed until a person approves, edits, or rejects. (Module 9)

Human-on-the-loop — An autonomy level where the agent acts immediately while a human monitors and can intervene after the fact. (Module 9)

Hybrid search — Running dense and lexical retrieval together and fusing their rankings (typically reciprocal rank fusion) for robustness on names, versions, and identifiers. (Module 6)

Hybrid system — deliberative planning layered over reactive execution — what most production agents end up being. (Module 1)

Idempotency — Property that performing the same operation twice has the same effect as once — enforced over HTTP with an Idempotency-Key header. (Module 10)

In-flight batching — Triton serving technique where new requests join a running batch instead of waiting for it to finish, lifting GPU utilization and throughput. (Module 12)

Indirect prompt injection — Prompt injection hidden inside content the agent processes (a web page, document, or email), planted by someone who never touches the agent’s interface. (Module 9)

Input rail — A guardrail that screens user or content input before the LLM sees it. (Module 9)

Interrupt — LangGraph’s HITL primitive: a call inside a node that pauses the graph, surfaces a payload, and resumes with the value carried by Command(resume=…); requires a checkpointer. (Module 9)

Jitter — A random factor added to each backoff delay so concurrent clients don’t retry in lockstep and re-create the failure burst. (Module 13)

Job queue — A holding line for pending jobs that controls how many run concurrently, regardless of how many arrive. (Module 10)

Knowledge graph — a knowledge store of entities and typed relationships; enables multi-hop relational reasoning that similarity search can’t follow. (Module 3)

Latency — How long one request takes, tracked at percentiles like p50 and p95. (Module 10)

Layered safety frameworks — Defense in depth for agents: multiple stacked guardrails where each layer catches what the previous one missed. (Module 9)

Leniency drift — Judge scores inflating over time or after judge updates; mitigated by anchored rubrics and re-judging a fixed sample on every judge change. (Module 8)

Lexical search — Keyword-based ranking (BM25 being the standard algorithm) that matches exact terms rather than meaning. (Module 6)

LLM workflow — a chain or router where your code fixes the sequence of steps and LLM calls fill in the slots. (Module 1)

LLM-as-judge — Using an LLM (usually a stronger one) to score another model’s output against an explicit rubric. (Module 8)

Load balancing — Distributing incoming requests across replicas so no instance saturates while others idle. (Module 10)

Logic tree — a branching if/else decision structure written in code — what a router’s dispatch table is. (Module 3)

MCP (Model Context Protocol) — the open standard for connecting an agent to tools and data sources. (Module 1)

Multi-agent system — several specialized agents coordinating on one goal. (Module 1)

Multimodal model — A model that accepts more than text — images, audio, video — in its input messages and reasons over the modalities jointly. (Module 12)

NeMo Agent Toolkit (NAT) — NVIDIA’s framework-agnostic toolkit for profiling, evaluating, and connecting agent workflows — formerly AgentIQ / Agent Intelligence Toolkit (AIQ). (Module 12)

NeMo Curator — NVIDIA’s training-data curation library (deduplication, filtering, quality scoring at corpus scale) — pip-installable and partly CPU. (Module 12)

NeMo Customizer — NVIDIA microservice for managed model fine-tuning; requires Kubernetes and GPUs. (Module 12)

NeMo Evaluator — NVIDIA microservice for managed evaluation of models and pipelines; requires Kubernetes and GPUs. (Module 12)

NeMo Retriever — NVIDIA’s family of retrieval NIMs (embedding, reranking) that productizes the ingestion/retrieval pipeline shape. (Module 6)

Nemotron — NVIDIA’s open model family, tuned for agentic workloads. (Module 1)

NIM (NVIDIA Inference Microservices) — packaged model endpoints: the same container and API whether NVIDIA hosts the model or you self-host it. (Module 1)

Node — a function that takes the graph state and returns an update. (Module 2)

Observability — The property of a system whose internal behavior can be understood from the telemetry it emits — logs, traces, metrics — without shipping new code to ask new questions. (Module 11)

Orchestration — the concept of coordinating multi-agent workflows; not to be confused with the supervisor, which is the agent doing the coordinating. (Module 3)

Outcome evaluation — Judging the final artifact of an agent run — is the result correct, grounded, cited?. (Module 8)

Output rail — A guardrail that screens the model’s response before the user sees it (e.g. PII masking, toxicity self-check). (Module 9)

p95 latency — The latency below which 95% of runs finish — the dashboard percentile of choice because it reflects what the slowest-served users actually experience. (Module 11)

Parallelism — Running independent subtasks simultaneously across agents (LangGraph supports fan-out with the Send API). (Module 7)

Permanent failure — A fault no amount of waiting fixes (revoked key, 404, malformed request) — never retried; fail fast and surface the real bug. (Module 13)

PII leakage — Personally identifiable information scooped from sources or conversations and re-emitted in outputs and logs. (Module 9)

PII masking — Detecting and replacing personally identifiable information — applied at the input, the output, and in logs/audit trails. (Module 9)

Plan-and-execute — A planning pattern where the agent drafts a complete multi-step plan up front, then executes the steps, replanning only if something forces a change. (Module 4)

Pointwise vs. pairwise judging — Scoring one output against a rubric (pointwise) versus comparing two outputs and picking a winner (pairwise). (Module 8)

Position bias — A pairwise judge favoring an answer because of its presentation order; mitigated by judging both orders and only counting agreement. (Module 8)

Procedural memory — Long-term memory of how to do things — for an LLM agent, the prompts and code themselves, versioned with the code. (Module 5)

Prompt chaining — the output of one LLM call becomes the input of the next; the order of steps is fixed in code. (Module 3)

Prompt injection — Content crafted to override an agent’s instructions and steer its behavior. (Module 9)

Qualifier — The modifier in a question stem — best, most cost-effective, first step — that selects among technically true options. (Module 14)

Quality gate — An automated set of checks (smoke tests plus eval regression) that must be green before a release is tagged. (Module 13)

Quantization — Representing weights and activations at lower numeric precision for smaller, faster inference at a controlled accuracy cost. (Module 12)

ReAct — an agent pattern that interleaves reasoning and action: think, act, observe, repeat until the goal is met. (Defined in Module 1; implemented in Module 2)

Reactive system — maps perception directly to action — no internal model, no planning. (Module 1)

Reasoning budget — The tokens, latency, and money you allow a system to spend thinking before it acts — at the model level (thinking tokens) and the system level (deliberation calls). (Module 4)

Reducer — tells LangGraph how to merge a node’s returned update into the shared state (e.g., append for messages). (Module 2)

Reflection — A self-correction pattern where a model (or a separate critic role) reviews an output and produces a revision informed by the critique. (Module 4)

Remote proctoring — Live-monitored, recorded supervision of an online exam session via webcam, microphone, and screen. (Module 14)

Replicas — Multiple identical copies of a container scheduled across machines to add capacity and survive failures. (Module 10)

Reranker — A cross-encoder that reads each query-passage pair together and re-scores a small candidate list with higher precision than vector comparison. (Module 6)

Retrieval rail — A guardrail that screens retrieved chunks before they enter the prompt. (Module 9)

Retrieval-augmented generation (RAG) — Architecture that retrieves relevant external content at query time and injects it into the model’s context before generation. (Module 6)

Retry with exponential backoff — Retrying a failed call after a delay that doubles on each attempt (1s, 2s, 4s…), with a hard cap on total attempts. (Module 13)

Review queue — An asynchronous worklist of pending approvals with context — the approval surface for high-volume or team settings. (Module 9)

Rolling updates — Replacing replicas one at a time during a deploy so the service never goes down. (Module 10)

Routing — a classifier (often a small, cheap model) dispatches the input to one of several specialized paths. (Module 3)

Runbook — The versioned document that turns an alert into action — symptom, checks in order, actions, escalation — written before the incident. (Module 11)

Self-preference bias — A judge favoring output from its own model or family; mitigated by judging with a different, stronger model. (Module 8)

Semantic memory — Long-term memory of facts, e.g. “this user wants concise English reports”. (Module 5)

Short-term / long-term memory — the working state of one session vs. what survives across sessions. (Module 1)

Span — One unit of work inside a trace — a node execution, LLM call, or tool call — with a start time, duration, and parent, so spans nest into a tree. (Module 11)

Specialization — Giving each agent one focused prompt for one job, instead of one prompt juggling every responsibility. (Module 7)

Stateful orchestration — multi-step coordination through a typed state object threaded through the steps. (Module 3)

StateGraph — a graph whose nodes read and update one shared, typed state object — LangGraph’s core primitive. (Module 2)

Statelessness — Property of a server that holds no run content, so any process reaching the persistence layer can serve any request about any job. (Module 10)

Store — A namespaced, cross-thread key-value memory ((namespace, key) → document) that nodes and application code read and write deliberately. (Module 5)

Streaming — Pushing incremental results to the client over a held-open channel (SSE or WebSocket) instead of waiting to be polled. (Module 10)

Structured user feedback — Feedback collected through a rubric with specific questions rather than a bare thumbs-up, so each answer converts into an evaluation case. (Module 8)

Summarization — Compressing older message history into a rolling summary that replaces the messages it covers. (Module 5)

Supervisor — a central agent that routes work to specialist agents and collects their results; every handoff passes through one auditable point. (Module 3)

Swarm — a multi-agent topology with no central coordinator: agents hand control directly to each other, peer to peer. (Module 3)

Task decomposition — Breaking a complex task into smaller, independently verifiable subtasks before solving any of them. (Module 4)

Task success rate — The fraction of tasks where the agent achieved the user’s actual goal, scored against the task rather than a reference string. (Module 8)

TensorRT-LLM — NVIDIA’s compiler/optimizer that builds a GPU-specific inference engine from model weights via quantization, KV-cache management, and fused kernels. (Module 12)

Thread — One persisted sequence of graph runs identified by a thread_id passed via config[“configurable”] — the unit of “same conversation”. (Module 5)

Throughput — How many requests a system completes per unit of time — for an agent service, jobs per minute. (Module 10)

Time travel — Reading a thread’s checkpoint history and resuming from any past checkpoint, forking the run from that point — debugging and replay. (Module 5)

Tool calling — the structured mechanism by which a model requests an action from your code: it emits a JSON request naming a function and arguments; your code executes and returns the result. (Module 2)

Tool misuse — An agent’s own tools turned against the operator — e.g. a search tool probing internal hosts or a write-capable tool used destructively. (Module 9)

tool_choice — the API parameter governing whether the model may call a tool: "auto" (model decides — the default), "required", "none", or a specific function. (Module 2)

Trace — The complete record of one request’s path through a system — for an agent, one full run from question to final output. (Module 11)

Trajectory — The sequence of steps an agent took during a run: which nodes ran, which tools were called, in what order, at what cost. (Module 8)

Trajectory evaluation — Judging the path of an agent run — skipped stages, redundant tool calls, node order. (Module 8)

Transient failure — A fault that fixes itself if you wait — a 429 rate limit, a 5xx from an overloaded service, a network timeout — and is therefore worth retrying. (Module 13)

Triton Inference Server — NVIDIA’s production model server: in-flight batching, multi- model hosting, and metrics endpoints. (Module 12)

Vector database — A store that indexes embeddings for fast approximate similarity search. (Module 6)

Vector memory — Storing memories as embeddings so the agent retrieves them by semantic similarity instead of by exact key. (Module 5)

Verbosity bias — A judge scoring longer answers higher at equal content; mitigated by density-rewarding rubrics or length normalization. (Module 8)