Agentic RAG: Build a Citation-Grounded Knowledge Pipeline with NVIDIA NIM (NCP-AAI Module 6)
This is Module 6 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.
Run Scout on Tuesday: “What did NVIDIA announce at GTC 2026?” It plans, searches, answers. Run it Wednesday on “Which companies are co-developing Nemotron 4?” — a question one inch to the left — and it starts from zero: same searches, same tokens, same dent in the free tier’s 40 requests per minute (as of June 2026).
Worse: everything Scout “read” on Tuesday was a search snippet — a few hundred characters per result. Module 2 stopped there deliberately; Scout has never opened a single page behind its results. Two problems wearing one trench coat: your agent never reads its sources, and nothing ties a claim to a source you could check.
By the end, Scout fetches the pages, keeps what it read in a persistent
knowledge base, decides when to consult it — and cites it with [n]
markers that resolve to URLs.
In this module
- You’ll learn:
- Build an ingestion pipeline — fetch → clean → chunk → embed → store — over the pages behind Scout’s search results: ETL for agents.
- Configure a local vector store (persistent Chroma): distance, top-k, metadata — what you tune when retrieval disappoints.
- Implement two-stage retrieval — dense top-k, then NIM reranking — and know when a reranker or hybrid search earns its cost.
- Design an agentic RAG: retrieval as a tool the agent chooses to call, not a fixed pipeline stage.
- Ground answers with
[n]citations traceable to sources — and decide RAG vs. fine-tuning vs. long context for a given need.
- You’ll build: a
fetch_pagefunction and an ingestion pipeline that turn the pages behind Scout’s search results into a persistent Chroma knowledge base, plus asearch_sourcestool that returns reranked, citable chunks. - Exam domains covered: D6 — Knowledge Integration and Data Handling — 10% of the exam.
- Prerequisites: Modules 1–5;
NVIDIA_API_KEYandTAVILY_API_KEYconfigured;uv.
Where you are
- ✅ Modules 1–5 — first NIM call, ReAct +
web_search, architecture, the Planner, memory - 👉 Module 6 — Knowledge Integration: RAG (you are here)
- ⬜ Modules 7–14 — multi-agent, evals, guardrails, deployment, the exam
Scout before: plans, searches the web (snippets only), remembers users across sessions. Scout after: reads full pages, owns a queryable knowledge base, answers with citations. The last brick in the single-agent story — in Module 7, Scout gets coworkers.
Why your agent needs a knowledge base (not just a search tool)
Retrieval-augmented generation (RAG) is the architecture that retrieves relevant external content at query time and injects it into the model’s context before generation. Three things live poorly inside model weights: knowledge fresher than the training cutoff, knowledge private to your company, and knowledge that must be attributable — answered with a source you can point at.
For an agent there’s a fourth reason: economics. Once Scout reads full pages, every page costs a fetch and an embedding bill — without a store, tomorrow’s neighboring question pays it all again; with one, retrieval costs one embedding call (plus a rerank request).
One distinction before we build, because the exam likes it: this is not what Module 5 built. The Module 5 store holds facts about the user — preferences, covered topics; memory. The RAG store indexes external content — the sources themselves; knowledge. Memory personalizes how Scout researches; knowledge grounds what it answers.
RAG is not the only way to make a model know things — the decision grid:
| RAG | Fine-tuning | Long context | |
|---|---|---|---|
| When to use | Fresh, private, or attributable knowledge | Behavior, style, format, domain vocabulary | Small, stable corpus, one task |
| Freshness | Re-ingest a page and it’s current | Frozen at training; refresh = retrain | Current, but re-pasted every call |
| Cost shape | Ingest once, retrieve cheap | High upfront, cheap per call | Full corpus in tokens, every call |
| Citability | Natural — you know which chunk answered | None — knowledge dissolves into weights | Weak — “somewhere in the prompt” |
| Example | ”What changed in the EU AI Act this quarter?" | "Answer in our support-ticket format" | "Summarize this 200-page contract” |
Fine-tuning’s tooling on the NVIDIA stack (NeMo Customizer) is covered in Module 12; here it stays a row in the table. What matters for Domain 6 is the diagnosis: changing knowledge + traceability requirement → RAG. The distractors will be a fine-tune or a bigger context window.
The ingestion pipeline: ETL for agent knowledge
The study guide names the pattern directly (objective 6.3): build ETL — extract, transform, load — pipelines to integrate external data sources. Scout’s version:
- Extract —
fetch_page(url): follow the URLs inweb_searchresults and pull the actual pages (new this module). - Transform — clean the HTML, deduplicate, chunk the text.
- Load — embed each chunk and upsert it (insert-or-update: re-writing an existing id replaces it) into the vector store, with metadata.
The transform stage is where pipelines are won or lost — exactly what
objective 6.4 tests (data quality checks and preprocessing). A raw web
page is mostly not content: scripts, navigation, cookie banners, footers.
Embed all that and you’ll retrieve all that. Before any embedding is paid:
strip <script>/<style> blocks and page chrome, normalize whitespace,
and deduplicate with a hash of url + content — re-ingesting an unchanged
page costs zero API requests. What survives keeps its provenance: url,
title, fetched_at — the Source contract fields (Step 2) — ride along
as metadata on every chunk, because a chunk that can’t say where it came
from can’t be cited.
Chunking — splitting documents into retrieval-sized pieces that are embedded and indexed individually — is the transform step with the most opinions and the fewest universal answers:
| Strategy | Principle | Strengths | Weaknesses | Use when |
|---|---|---|---|---|
| Fixed-size | Cut every N tokens/chars, fixed overlap | Trivial, predictable | Cuts mid-sentence, mid-thought | Uniform text, quick baselines |
| Recursive / structure-aware | Split on paragraphs, then lines, then sentences, until pieces fit | Respects structure; rarely cuts thoughts | More code; ragged sizes | Default for HTML/Markdown — what Scout uses |
| Semantic | Split where embedding similarity between sentences drops | Topically coherent chunks | Costs embeddings during chunking; fussier | High-value corpora that justify it |
Internalize the size/overlap trade-off — the exam tests it as a diagnosis. Small chunks → precise embeddings, but the retrieved fragment may lack the context to be usable. Large chunks → more context per hit, but the embedding averages over several topics and noise rides along. Overlap (~10–20%) keeps a thought alive across a cut. Scout chunks at ~500 tokens with 15% overlap (the lab assumes ~4 characters per token, so 2000 characters ≈ 500 tokens). The embedding NIM’s 8192-token input limit (model card, June 2026) is a different, far-off ceiling — we chunk at 500 for retrieval precision, not because the model forces us to.
Loading turns chunks into vectors. An embedding is a vector
representation of text in which semantic similarity becomes geometric
proximity — “GPU prices” and “graphics card costs” land close together,
closeness measured by cosine similarity (the angle between
vectors — magnitude-blind). Scout uses the hosted
llama-nemotron-embed-1b-v2 NIM (2048 dimensions, standard
/v1/embeddings route — the same openai client as Module 1). One trap,
straight from the model card: it is an asymmetric embedding model —
documents are embedded with input_type="passage", questions with
input_type="query"; the two modes project into the same space
differently, by design. Mix them up and nothing errors — retrieval just
quietly gets worse.
Vector stores and retrieval quality
A vector database is a store that indexes embeddings for fast similarity search. Comparing a query against every stored vector dies at scale, so vector databases build approximate nearest neighbor (ANN) indexes — graph structures like HNSW (Hierarchical Navigable Small World) that trade a sliver of exactness for sub-linear lookup.
This course uses Chroma — embedded, persistent, zero infrastructure: a directory on disk. The alternatives, one line each: Qdrant is the production-shaped version of the same idea (a real server, replication, quotas); FAISS is an indexing library, not a database — blazing, but persistence and metadata are your problem.
What you configure (objective 6.2):
- Distance metric — cosine for this embedding model (Chroma defaults to L2, straight-line Euclidean distance; we override it at collection creation).
- Top-k — how many candidates a query returns: a recall knob (how many of the truly relevant chunks make it into the candidate set), not a quality knob.
- Metadata filters — chunks carry
url,title,fetched_at,source_id; filters scope retrieval (“only chunks fetched this week”) without touching the vectors. - Persistence —
PersistentClientwrites to disk; the knowledge base survives restarts.
Dense retrieval (embedding similarity) understands meaning — and that
is its blind spot: it happily treats nemotron-3-nano and
nemotron-3-super as near-twins, catastrophic when the version string is
the question. Lexical search — keyword ranking, BM25 being the standard
algorithm — matches exact terms and never makes that mistake, but knows
nothing about synonyms. Hybrid search runs both and fuses the rankings
(typically reciprocal rank fusion), buying robustness on names, versions,
and identifiers. Know the concept and the failure mode it fixes; the lab
stays dense + rerank, hybrid is an exercise.
The upgrade the lab does ship is two-stage retrieval. A reranker is a cross-encoder: instead of comparing two precomputed vectors, it reads the query and a candidate passage together and scores the pair — far more accurate, far too slow to run on a whole collection. So you compose: a fast dense search casts a wide net (top-10, built for recall), then the reranking NIM re-scores those ten pairs and keeps the best three (built for precision). The reranker earns its latency when wrong-but-similar chunks outrank right ones — in practice, most knowledge bases past a few dozen documents.
The whole machine, both paths:
flowchart TB
subgraph ing ["Ingestion path — after every web_search"]
direction LR
WS["web_search results"] --> FP["fetch_page (httpx)"]
FP --> CL["clean + dedup"]
CL --> CK["chunk ~500 tokens, 15% overlap"]
CK --> EP["embed input_type=passage"]
EP --> DB[("Chroma collection scout_sources")]
end
subgraph qry ["Query path — when the agent calls search_sources"]
direction LR
AG["agent"] --> SS["search_sources(query)"]
SS --> EQ["embed input_type=query"]
EQ --> TK["dense top-10"]
TK --> RR["rerank NIM (cross-encoder)"]
RR --> T3["top 3 + citation markers"]
T3 --> ANS["grounded, cited answer"]
end
DB -.-> TK
Two paths, one collection — passage on the way in, query on the way
out.
Agentic RAG: retrieval as a tool, not a pipeline
Classic RAG is a pipeline: every question gets embedded, retrieval always runs, the top-k always lands in the prompt. Fine for a documentation chatbot; Scout’s questions aren’t like that — some need the web, some its knowledge base, some no retrieval at all.
Agentic RAG makes retrieval a decision: the knowledge base is exposed
as a tool the agent chooses to call — or skip, or retry with a reformulated
query, or fall back from. In Scout that tool is search_sources, next to
web_search, and the system prompt states the policy: knowledge base
first — even when the research plan lists web queries; web for what’s new;
pages behind new results get ingested for next time.
| Classic RAG pipeline | Agentic RAG | |
|---|---|---|
| Who decides to retrieve | Your code — always runs | The agent — per question |
| Query reformulation | None — the user’s words are the query | The agent rewrites, splits, retries |
| ”Nothing found” | Irrelevant top-k lands in the prompt anyway | The agent falls back (web search) or says so |
| Cost / latency | One retrieval, fixed, cheap | One extra LLM decision per hop |
| Sufficient when | Homogeneous questions, one corpus | Open-ended questions, several knowledge sources |
flowchart TD
Q["question"] --> D{"agent decides"}
D -->|"may overlap past research"| S["search_sources"]
D -->|"new topic / needs fresh data"| W["web_search + fetch + ingest"]
D -->|"no retrieval needed"| A["answer directly"]
S -->|"nothing relevant"| W
S --> ANS["cited answer"]
W --> ANS
The decision loop the system prompt encodes.
Objective 6.5 lives here: real-time access and reasoning over structured
and unstructured knowledge. The unstructured half is the
chunks; the structured half is the metadata riding on them — fetched_at
lets the agent (or a filter) judge whether its knowledge is fresh enough
and re-search the web when it isn’t: knowledge with a timestamp is
knowledge you can distrust on schedule. When knowledge is relational
rather than textual — “which services depend on X?” — the right structure
is a knowledge graph, placed in Module 3 and left unbuilt in this course.
My opinion, stated as one: retrieval-as-tool wins the moment questions are open-ended, but you pay one extra LLM decision per hop — and on homogeneous questions the boring pipeline is faster, cheaper, and right. Match the structure to the questions, not to the demo.
Grounding and citations: answers you can verify
Grounding means every claim in a generated answer is supported by retrieved source material — the answer stands on chunks, not vibes.
Scout’s grounding contract is mechanical and course-wide: answers carry
Markdown markers [n], where n is the 1-based position of the source in
ScoutState.sources — the state field born in this module. The difference
four lines make:
Before: NVIDIA announced the Nemotron Coalition at GTC 2026, with
Mistral, LangChain, Cursor, and Perplexity among the members.
After: NVIDIA announced the Nemotron Coalition at GTC 2026 [1];
members include Mistral, LangChain, Cursor, and Perplexity [1][2].
References:
[1] NVIDIA Newsroom — https://nvidianews.nvidia.com/...
[2] ...
The first answer asks for trust. The second can be checked, claim by claim, against numbered sources.
In practice the contract is three pieces of plumbing, none clever: the
retrieval tool returns each chunk with its citation number and source
metadata; the system prompt requires the markers; post-processing resolves
the markers actually used into a References list mapped to URLs.
The honest limitation: the model can still cite the wrong source, or
decorate an unsupported sentence with a plausible [2]. The contract
makes claims checkable; measuring how often they check out is an
evaluation problem — Module 8’s job.
Hands-on lab: build it
Scout learns to read. The full code lives in
module-06/
of the labs repo; one new dependency (uv add "chromadb~=1.5");
NVIDIA_API_KEY and TAVILY_API_KEY in the repo-root .env as usual
(the retrieval NIMs bill to the same key).
Objective: fetch the pages behind Scout’s search results, ingest them
into a persistent Chroma knowledge base, and expose a search_sources
tool that returns reranked, citable passages.
Observable result: a first session fetches and ingests pages; a second
session on a neighboring question answers the core question from
search_sources — the pages it needs are read from Chroma, not re-fetched —
and both answers end with a References: list resolving their [n]
markers. uv run pytest module-06/tests/ is green.
Step 1 — Embeddings: one thin, asymmetric function
scout/embeddings.py (new). The embeddings NIM speaks the standard
/v1/embeddings route, so llm.get_client() already talks to it.
NVIDIA-specific parts travel in extra_body; input_type is required, not
defaulted — forgetting it is the silent killer from the theory section:
def embed_texts(texts: list[str], input_type: str) -> list[list[float]]:
if input_type not in ("passage", "query"):
raise ValueError(f"input_type must be 'passage' or 'query', got {input_type!r}")
client = llm.get_client()
vectors: list[list[float]] = []
for start in range(0, len(texts), config.EMBED_BATCH_SIZE):
batch = texts[start : start + config.EMBED_BATCH_SIZE]
for attempt in range(config.MAX_RETRIES + 1):
try:
response = client.embeddings.create(
model=config.EMBED_MODEL,
input=batch,
extra_body={"input_type": input_type, "truncate": "END"},
)
break
except RateLimitError:
if attempt == config.MAX_RETRIES:
raise
time.sleep(2**attempt) # NIM free tier: 40 req/min — back off here
vectors.extend(item.embedding for item in response.data)
return vectors
Batching is the quiet win: the free tier limits requests, not tokens —
32 chunks per request is a 32× saving on your scarcest resource. The model
IDs join config.py (EMBED_MODEL, RERANK_MODEL, RERANK_URL) and
nowhere else; the smoke tests grep for leaks.
Step 2 — Extract: fetch_page
scout/ingest.py (new). The function Module 2 refused to write — httpx
again:
def fetch_page(url: str, timeout: float = 15.0) -> Source:
"""Extract: fetch one page and shape it into the frozen Source contract."""
response = httpx.get(
url,
timeout=timeout,
follow_redirects=True,
headers={"User-Agent": "Scout/0.6 (NCP-AAI course lab)"},
)
response.raise_for_status()
raw = response.text
match = _TITLE.search(raw)
# Titles end up in one-line reports and reference lists: flatten them.
title = " ".join(html.unescape(match.group(1)).split()) if match else url
return {
"url": url,
"title": title or url,
"fetched_at": datetime.now(timezone.utc).isoformat(timespec="seconds"),
"content": clean_html(raw),
"reliability_score": NEUTRAL_RELIABILITY,
}
_TITLE — a <title> regex — and clean_html (next step) live in
scout/ingest.py in the repo. The return shape is the course-wide
Source contract, added to scout/state.py this module:
{url, title, fetched_at, content, reliability_score}. Reliability stays
at a neutral 0.5 — the Reader starts computing it in Module 7.
Step 3 — Transform: clean, dedup, chunk
Three functions in scout/ingest.py, in pipeline order. clean_html()
drops <script>/<style> and page chrome, strips tags, normalizes
whitespace. source_id_for() hashes url + content — the dedup key.
chunk_text() is the recursive splitter from the theory section:
def chunk_text(
text: str,
size: int = config.CHUNK_CHARS,
overlap: int = config.CHUNK_OVERLAP_CHARS,
) -> list[str]:
chunks: list[str] = []
current = ""
for piece in _split(text, ["\n\n", "\n", ". "], size):
if current and len(current) + len(piece) + 1 > size:
chunks.append(current.strip())
current = current[-overlap:] # the tail survives the cut
current = f"{current}\n{piece}" if current else piece
if current.strip():
chunks.append(current.strip())
return chunks
_split (in the repo) recurses through the separators — paragraphs, then
lines, then sentences — and hard-cuts only when a blob has no boundaries.
CHUNK_CHARS = 2000 and CHUNK_OVERLAP_CHARS = 300 live in config.py.
Step 4 — Load: the persistent collection
Chroma 1.x, embedded and persistent — a gitignored directory inside the
module (module-06/scout_knowledge/). Cosine distance, because the
embedding model is trained for it;
Chroma defaults to L2:
def get_collection(db_dir: Path | None = None) -> "chromadb.Collection":
client = chromadb.PersistentClient(path=str(db_dir or DB_DIR))
return client.get_or_create_collection(
name=COLLECTION_NAME,
configuration={"hnsw": {"space": "cosine"}},
)
ingest_source() chains the stages: dedup check first (an unchanged page
costs zero embedding requests — the smoke test asserts it), then chunk,
then embed_texts(chunks, input_type="passage"), then upsert with
metadata {url, title, fetched_at, source_id, chunk_index} — what makes a
chunk citable and filterable.
Step 5 — Two-stage retrieval
scout/retriever.py (new). Stage 1 for recall, stage 2 for precision:
def search(query: str, collection=None) -> list[dict]:
collection = collection if collection is not None else ingest.get_collection()
if collection.count() == 0:
return []
vector = embeddings.embed_texts([query], input_type="query")[0]
dense = collection.query(
query_embeddings=[vector],
n_results=min(config.TOP_K, collection.count()),
include=["documents", "metadatas"],
)
documents: list[str] = dense["documents"][0]
metadatas: list[dict] = dense["metadatas"][0]
best = rerank(query, documents)[: config.RERANK_TOP_N]
return [
{
"text": documents[i],
"url": metadatas[i]["url"],
"title": metadatas[i]["title"],
"fetched_at": metadatas[i]["fetched_at"],
"source_id": metadatas[i]["source_id"],
}
for i in best
]
rerank() is a plain httpx POST — the reranking NIM has no route in the
openai SDK, so it gets a dedicated REST endpoint, pinned in config.py.
The payload pairs the query with each passage; the response’s rankings
come back best-first and we keep the top 3.
Step 6 — Retrieval becomes a tool
scout/tools.py (modified): search_sources joins web_search in TOOLS
and TOOL_SCHEMAS. The schema description does the steering — prompt
engineering, as Module 2 taught:
"description": (
"Search the sources Scout has already read — full pages fetched "
"and indexed during this and past sessions. Much faster and "
"cheaper than a new web search. Try it FIRST whenever the "
"question may overlap earlier research; fall back to web_search "
"if it returns nothing relevant. Returns the most relevant "
"passages with citation markers, source URL, and fetch date."
),
web_search’s description is rewritten too — it must now defer to the
knowledge base (“Use this when the knowledge base has nothing relevant, or
when the question needs information fresher than what Scout has already
read”): Module 2’s “any fact you are not certain about” would fight the
knowledge-base-first policy head-on. The same file gains INGEST_MARKER,
so describe_observation() can print the [tools] 5 results — 3 pages processed trace line.
Step 7 — Sources and citations in the state
ScoutState gains its module-6 field — sources, append-only like
messages — and tools_node (scout/nodes.py) becomes the bookkeeper:
after a web_search it fetches and ingests the pages behind the top
results, appends a numbered report to the observation, and compacts the
snippets of the pages it indexed (their full text now lives in Chroma);
search_sources hits get their passages prefixed with markers:
def _number_chunks(observation: str, registry: _SourceRegistry) -> str:
"""Module 06: turn retriever hits into numbered, citable passages."""
try:
chunks = json.loads(observation)
except json.JSONDecodeError:
return observation # the retrieval failed; surface the error text
if not chunks:
return (
"No relevant passages in the knowledge base. "
"Use web_search to research this on the web."
)
lines = []
for chunk in chunks:
source: Source = {
"url": chunk["url"],
"title": chunk["title"],
"fetched_at": chunk["fetched_at"],
"content": chunk["text"],
"reliability_score": ingest.NEUTRAL_RELIABILITY,
}
n = registry.number(source)
lines.append(
f"[{n}] {chunk['title']} — {chunk['url']} "
f"(fetched {chunk['fetched_at']})\n{chunk['text']}"
)
return "\n\n".join(lines)
The _SourceRegistry (15 lines, in the repo) assigns each new URL the next
number and reuses numbers for known URLs — [n] always means the n-th
source of this run. The system prompt closes the loop — place the EXACT
bracketed markers shown in the tool results; never renumber them, and do
not write your own reference list. scout/run.py prints the real References: list by
resolving the markers actually used. Two budgets move in config.py
(MAX_TOKENS 8192 → 16384, MAX_ITERATIONS 6 → 8 — RAG-fed turns reason
longer; the knowledge-base probe costs a turn), but graph.py is
untouched — a Module 3 point proven: ingestion and citations are node and
tool plumbing, not topology.
Step 8 — Run it twice
cd module-06
uv run python -m scout.run "What is the Nemotron Coalition announced at GTC 2026?" --thread coalition
Real output, after the Planner’s two passes (your URLs and marker numbers will differ; answers trimmed to the load-bearing lines):
[agent] tool_call: search_sources({"query": "Nemotron Coalition"})
[tools] No relevant passages in the knowledge base. Use web_search to research this on the web.
[agent] tool_call: web_search({"query": "Nemotron Coalition GTC 2026 announcement transcript"})
[tools] 5 results — 3 pages processed
[agent] tool_call: search_sources({"query": "founding members of Nemotron Coalition"})
[tools] [2] NVIDIA GTC 2026 Recap: AI Factory, OpenClaw, and Nemotron | Nowel Pitt, MBA posted on the topic
[agent] tool_call: web_search({"query": "NVIDIA newsroom Nemotron Coalition"})
[tools] 5 results — 3 pages processed
The Nemotron Coalition was announced at GTC 2026 as a global collaboration …
to advance open frontier AI models through shared research, data, expertise
and compute [3]. Its founding members … are Black Forest Labs, Cursor,
LangChain, Mistral AI, Perplexity, Reflection AI, Sarvam and Thinking
Machines [3]. …
References:
[2] NVIDIA GTC 2026 Recap: AI Factory, OpenClaw, and Nemotron | … — https://www.linkedin.com/posts/nowelpitt_…
[3] NVIDIA Launches Nemotron Coalition of Leading Global AI Labs to Advance Open Frontier Models | NVIDIA Newsroom — http://nvidianews.nvidia.com/news/nvidia-launches-nemotron-coalition-of-leading-global-ai-labs-to-advance-open-frontier-models
[4] NVIDIA forms Nemotron coalition to advance open AI - Engineering.com — https://www.engineering.com/nvidia-forms-nemotron-coalition-to-advance-open-ai
Scout probed its empty knowledge base, fell back to the web, ingested what it found — then retrieved from the pages it had just indexed. Every decision in that sequence was the model’s. Now the question one inch to the left:
uv run python -m scout.run "Which companies joined the Nemotron Coalition?" --thread coalition-2
[agent] tool_call: search_sources({"query": "Nemotron Coalition GTC 2026 press release membership list"})
[tools] [1] NVIDIA Launches Nemotron Coalition of Leading Global AI Labs to Advance Open Frontier Models | N
[agent] tool_call: search_sources({"query": "site:nemotroncoalition.org member directory"})
[tools] [1] NVIDIA Launches Nemotron Coalition of Leading Global AI Labs to Advance Open Frontier Models | N
[agent] tool_call: web_search({"query": "Nemotron Coalition member roster GTC 2026 news"})
[tools] 5 results — 3 pages processed
The official GTC 2026 press release announces the founding members of the
Nemotron Coalition as Black Forest Labs, Cursor, LangChain, Mistral AI,
Perplexity, Reflection AI, Sarvam and Thinking Machines Lab [1]. … On
June 4 2026, H Company announced that it has joined the coalition [4]. …
References:
[1] NVIDIA Launches Nemotron Coalition of Leading Global AI Labs to Advance Open Frontier Models | NVIDIA Newsroom — http://nvidianews.nvidia.com/news/…
[2] NVIDIA forms Nemotron coalition to advance open AI - Engineering.com — https://www.engineering.com/…
[4] H Company, France computer-use AI champion, joins the NVIDIA Nemotron Coalition - H Company — https://hcompany.ai/h-joins-nemotron-coalition
The roster came straight from the knowledge base — the press release was read from Chroma, not re-fetched. The one web check the Planner (fed Module 5’s covered-topics memory) ordered is what found H Company’s June 4 announcement: each tool did what it is good at. Finally, the tests:
cd .. # back to the repo root
uv run pytest module-06/tests/ # offline: fixtures + fakes
SCOUT_LIVE_TESTS=1 uv run pytest module-06/tests/ # + NIM roundtrip + 1 real run
Try it yourself (no solution provided):
- Freshness filter: make
search()ignore chunks older than 7 days. Chroma’swherefilters compare numbers, not ISO strings — add a numericfetched_at_ts(epoch seconds) to chunk metadata at ingestion and filter{"fetched_at_ts": {"$gte": cutoff}}at query time. - Hybrid search, the simple version: BM25 over the collection’s documents (pure Python, ~30 lines), fuse its top-10 with the dense top-10 by reciprocal rank fusion (score = Σ 1/(k + rank) across both rankings, k ≈ 60), then rerank. Test on a query full of exact identifiers — where dense-only retrieval misses.
Exam corner
What the exam tests here. Per the official study guide, Domain 6 (10%) expects you to: implement retrieval pipelines — RAG, embedded search, hybrid approaches (6.1); configure and optimize vector databases for fast retrieval (6.2); build ETL pipelines to integrate external data sources (6.3); conduct data quality checks, augmentation, and preprocessing (6.4); and enable real-time access and reasoning over structured and unstructured knowledge (6.5). Questions are scenarios: a symptom in, a pipeline decision out.
Quiz — answers after question 5.
-
A knowledge base is chunked at a fixed 100 characters, no overlap. Users complain that answers quote sentence fragments and miss context that sits right next to the quoted text in the source document. Best fix?
- A) Increase top-k from 3 to 10 so more fragments arrive
- B) Re-chunk: structure-aware splitting at a larger size with ~15% overlap, then re-embed the corpus
- C) Switch to a larger LLM with a bigger context window
- D) Add a reranker on top of the existing chunks
-
A firm’s policy documents change weekly, and compliance requires every answer to point at the exact passage it came from. Which approach fits?
- A) Fine-tune the model on the policy corpus every quarter
- B) Paste the full policy corpus into a 1M-token context per query
- C) A RAG pipeline over the documents, re-ingesting on change, with citations to retrieved chunks
- D) Train a custom model from scratch on company data
-
For most failing queries the correct passage is in the dense top-10 — but answers keep relying on a similar-sounding wrong passage that ranks higher. Raising top-k to 50 made answers worse. The right move?
- A) Raise top-k further — eventually the right passage will dominate
- B) Add a cross-encoder reranking stage: re-score the top candidates against the query, keep the best few
- C) Lower the chunk size so passages are more precise
- D) Swap the LLM for one with a longer context window
-
A research assistant must answer only from sources fetched in the last 7 days, keep its index across restarts, and use its embedding model’s training metric. Which configuration delivers all three?
- A) An in-memory collection, default metric, re-ingest everything daily
- B) A persistent collection with cosine distance and a numeric
fetched_atmetadata filter at query time - C) A persistent collection with a higher top-k so recent chunks appear
- D) Prepend the fetch date to each chunk’s text so the model can judge freshness
-
A team’s RAG answers keep quoting cookie banners and navigation menus, and the same article appears three times in retrieval results. Which pipeline stage is missing?
- A) A reranking stage after dense retrieval
- B) A bigger embedding model
- C) Cleaning and deduplication in the transform stage, before embedding
- D) Higher chunk overlap so menus get diluted
Answers. 1 — B. Fragments and lost neighboring context are a chunking diagnosis: too small, no overlap. A retrieves more fragments; D reranks fragments — neither restores context destroyed at ingestion. 2 — C. Weekly change kills A (stale between retrains — fine-tuning shapes behavior, not citable knowledge) and D (costlier, same staleness); B re-pays the corpus per query with no passage-level citation. Fresh + traceable is RAG’s home turf. 3 — B. The candidates are retrieved but mis-ranked — precisely the reranker’s job. A and D add noise; C re-chunks a corpus whose recall is already fine. 4 — B. Three requirements, three knobs: persistence, cosine, metadata filter. A re-pays ingestion daily; C hopes instead of filters; D pollutes embeddings with date strings. The filter is the structured (metadata) half of objective 6.5’s reasoning over structured + unstructured knowledge. 5 — C. Boilerplate in results and duplicates are ingestion defects: clean and dedup before embedding is paid. A reranker (A) ranks garbage more carefully; B embeds it more accurately; D spreads it around.
Traps to avoid:
- “Fine-tuning adds fresh knowledge.” It shapes behavior — style, format, vocabulary. Fresh, citable knowledge is RAG’s job; when a scenario pairs “changes often” with “must cite sources,” fine-tune options are distractors.
- Two token budgets, not one. The embedding model’s input limit (8192 tokens here) and the LLM’s context window (1M for Nemotron 3 Nano) are independent. Chunk size respects the embedder — chosen well below its ceiling, for precision; the context budget constrains generation.
- “Raise top-k to improve answers.” Top-k buys recall; past a point it pollutes the context. Precision comes from reranking and cleaner ingestion, not volume — if the right chunk is retrieved but ignored, more chunks make it worse.
- Boilerplate in retrieval results is an ingestion defect. Fix the ETL transform stage — clean and dedup before embedding — not the prompt or the reranker: they rank and phrase garbage, they don’t remove it.
- “Structured + unstructured” is one query, two halves. Objective 6.5’s real-time reasoning pairs unstructured chunks with the structured metadata riding on them — timestamps, sources, filters. An option that handles the text but ignores the metadata (or the reverse) answers half the question.
Key takeaways
- RAG earns its place with what weights can’t offer: fresh, private, citable knowledge — a 1M-token context window doesn’t change that.
- Memory (Module 5) stores facts about the user; the knowledge base stores external content. Different store, different role — the exam tests the distinction.
- Ingestion is ETL, and quality is decided in the transform: clean and dedup before embedding, or retrieve boilerplate forever.
- Chunking is a trade-off, not a dogma: small = precise but contextless, large = contextual but noisy; structure-aware with ~15% overlap is the sane default.
- Retrieve wide, rerank narrow — dense top-10 for recall, cross-encoder
top-3 for precision — and embed asymmetrically:
passageto index,queryto search; mixing them fails silently. - Agentic RAG means the agent decides when to retrieve, reformulate, or fall back — worth one extra LLM decision per hop on open-ended questions.
- Citations
[n]are a contract, not decoration: each marker resolves mechanically to a source in state — what makes answers checkable at all.
Keep going
Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.
Scout has knowledge and memory. Next, it gets coworkers: a supervisor, a Searcher, a Reader, a Fact-checker, and a Writer.
Lab code · Course index · ← Module 5 · Module 7 →
References
- NCP-AAI certification page — the official blueprint; Knowledge Integration and Data Handling is 10%.
- llama-nemotron-embed-1b-v2 — API reference —
the embedding NIM: 2048 dims, 8192-token input limit,
input_type. - llama-nemotron-rerank-1b-v2 — API reference — the reranking NIM and its dedicated retrieval endpoint.
- Chroma docs — configuring collections —
the 1.x
configurationblock, including HNSW distance settings. - Lewis et al., 2020 — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks: the paper that named RAG.
- Building RAG Agents with LLMs — the NVIDIA DLI course the study guide recommends for this domain.