Agentic RAG: Build a Citation-Grounded Knowledge Pipeline with NVIDIA NIM (NCP-AAI Module 6)

This is Module 6 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.

Run Scout on Tuesday: “What did NVIDIA announce at GTC 2026?” It plans, searches, answers. Run it Wednesday on “Which companies are co-developing Nemotron 4?” — a question one inch to the left — and it starts from zero: same searches, same tokens, same dent in the free tier’s 40 requests per minute (as of June 2026).

Worse: everything Scout “read” on Tuesday was a search snippet — a few hundred characters per result. Module 2 stopped there deliberately; Scout has never opened a single page behind its results. Two problems wearing one trench coat: your agent never reads its sources, and nothing ties a claim to a source you could check.

By the end, Scout fetches the pages, keeps what it read in a persistent knowledge base, decides when to consult it — and cites it with [n] markers that resolve to URLs.

In this module

You’ll learn:
- Build an ingestion pipeline — fetch → clean → chunk → embed → store — over the pages behind Scout’s search results: ETL for agents.
- Configure a local vector store (persistent Chroma): distance, top-k, metadata — what you tune when retrieval disappoints.
- Implement two-stage retrieval — dense top-k, then NIM reranking — and know when a reranker or hybrid search earns its cost.
- Design an agentic RAG: retrieval as a tool the agent chooses to call, not a fixed pipeline stage.
- Ground answers with [n] citations traceable to sources — and decide RAG vs. fine-tuning vs. long context for a given need.
You’ll build: a fetch_page function and an ingestion pipeline that turn the pages behind Scout’s search results into a persistent Chroma knowledge base, plus a search_sources tool that returns reranked, citable chunks.
Exam domains covered: D6 — Knowledge Integration and Data Handling — 10% of the exam.
Prerequisites: Modules 1–5; NVIDIA_API_KEY and TAVILY_API_KEY configured; uv.

Where you are

✅ Modules 1–5 — first NIM call, ReAct + web_search, architecture, the Planner, memory
👉 Module 6 — Knowledge Integration: RAG (you are here)
⬜ Modules 7–14 — multi-agent, evals, guardrails, deployment, the exam

Scout before: plans, searches the web (snippets only), remembers users across sessions. Scout after: reads full pages, owns a queryable knowledge base, answers with citations. The last brick in the single-agent story — in Module 7, Scout gets coworkers.

Why your agent needs a knowledge base (not just a search tool)

Retrieval-augmented generation (RAG) is the architecture that retrieves relevant external content at query time and injects it into the model’s context before generation. Three things live poorly inside model weights: knowledge fresher than the training cutoff, knowledge private to your company, and knowledge that must be attributable — answered with a source you can point at.

For an agent there’s a fourth reason: economics. Once Scout reads full pages, every page costs a fetch and an embedding bill — without a store, tomorrow’s neighboring question pays it all again; with one, retrieval costs one embedding call (plus a rerank request).

One distinction before we build, because the exam likes it: this is not what Module 5 built. The Module 5 store holds facts about the user — preferences, covered topics; memory. The RAG store indexes external content — the sources themselves; knowledge. Memory personalizes how Scout researches; knowledge grounds what it answers.

RAG is not the only way to make a model know things — the decision grid:

	RAG	Fine-tuning	Long context
When to use	Fresh, private, or attributable knowledge	Behavior, style, format, domain vocabulary	Small, stable corpus, one task
Freshness	Re-ingest a page and it’s current	Frozen at training; refresh = retrain	Current, but re-pasted every call
Cost shape	Ingest once, retrieve cheap	High upfront, cheap per call	Full corpus in tokens, every call
Citability	Natural — you know which chunk answered	None — knowledge dissolves into weights	Weak — “somewhere in the prompt”
Example	”What changed in the EU AI Act this quarter?"	"Answer in our support-ticket format"	"Summarize this 200-page contract”

Fine-tuning’s tooling on the NVIDIA stack (NeMo Customizer) is covered in Module 12; here it stays a row in the table. What matters for Domain 6 is the diagnosis: changing knowledge + traceability requirement → RAG. The distractors will be a fine-tune or a bigger context window.

The ingestion pipeline: ETL for agent knowledge

The study guide names the pattern directly (objective 6.3): build ETL — extract, transform, load — pipelines to integrate external data sources. Scout’s version:

Extract — fetch_page(url): follow the URLs in web_search results and pull the actual pages (new this module).
Transform — clean the HTML, deduplicate, chunk the text.
Load — embed each chunk and upsert it (insert-or-update: re-writing an existing id replaces it) into the vector store, with metadata.

The transform stage is where pipelines are won or lost — exactly what objective 6.4 tests (data quality checks and preprocessing). A raw web page is mostly not content: scripts, navigation, cookie banners, footers. Embed all that and you’ll retrieve all that. Before any embedding is paid: strip <script>/<style> blocks and page chrome, normalize whitespace, and deduplicate with a hash of url + content — re-ingesting an unchanged page costs zero API requests. What survives keeps its provenance: url, title, fetched_at — the Source contract fields (Step 2) — ride along as metadata on every chunk, because a chunk that can’t say where it came from can’t be cited.

Chunking — splitting documents into retrieval-sized pieces that are embedded and indexed individually — is the transform step with the most opinions and the fewest universal answers:

Strategy	Principle	Strengths	Weaknesses	Use when
Fixed-size	Cut every N tokens/chars, fixed overlap	Trivial, predictable	Cuts mid-sentence, mid-thought	Uniform text, quick baselines
Recursive / structure-aware	Split on paragraphs, then lines, then sentences, until pieces fit	Respects structure; rarely cuts thoughts	More code; ragged sizes	Default for HTML/Markdown — what Scout uses
Semantic	Split where embedding similarity between sentences drops	Topically coherent chunks	Costs embeddings during chunking; fussier	High-value corpora that justify it

Internalize the size/overlap trade-off — the exam tests it as a diagnosis. Small chunks → precise embeddings, but the retrieved fragment may lack the context to be usable. Large chunks → more context per hit, but the embedding averages over several topics and noise rides along. Overlap (~10–20%) keeps a thought alive across a cut. Scout chunks at ~500 tokens with 15% overlap (the lab assumes ~4 characters per token, so 2000 characters ≈ 500 tokens). The embedding NIM’s 8192-token input limit (model card, June 2026) is a different, far-off ceiling — we chunk at 500 for retrieval precision, not because the model forces us to.

Loading turns chunks into vectors. An embedding is a vector representation of text in which semantic similarity becomes geometric proximity — “GPU prices” and “graphics card costs” land close together, closeness measured by cosine similarity (the angle between vectors — magnitude-blind). Scout uses the hosted llama-nemotron-embed-1b-v2 NIM (2048 dimensions, standard /v1/embeddings route — the same openai client as Module 1). One trap, straight from the model card: it is an asymmetric embedding model — documents are embedded with input_type="passage", questions with input_type="query"; the two modes project into the same space differently, by design. Mix them up and nothing errors — retrieval just quietly gets worse.

Vector stores and retrieval quality

A vector database is a store that indexes embeddings for fast similarity search. Comparing a query against every stored vector dies at scale, so vector databases build approximate nearest neighbor (ANN) indexes — graph structures like HNSW (Hierarchical Navigable Small World) that trade a sliver of exactness for sub-linear lookup.

This course uses Chroma — embedded, persistent, zero infrastructure: a directory on disk. The alternatives, one line each: Qdrant is the production-shaped version of the same idea (a real server, replication, quotas); FAISS is an indexing library, not a database — blazing, but persistence and metadata are your problem.

What you configure (objective 6.2):

Distance metric — cosine for this embedding model (Chroma defaults to L2, straight-line Euclidean distance; we override it at collection creation).
Top-k — how many candidates a query returns: a recall knob (how many of the truly relevant chunks make it into the candidate set), not a quality knob.
Metadata filters — chunks carry url, title, fetched_at, source_id; filters scope retrieval (“only chunks fetched this week”) without touching the vectors.
Persistence — PersistentClient writes to disk; the knowledge base survives restarts.

Dense retrieval (embedding similarity) understands meaning — and that is its blind spot: it happily treats nemotron-3-nano and nemotron-3-super as near-twins, catastrophic when the version string is the question. Lexical search — keyword ranking, BM25 being the standard algorithm — matches exact terms and never makes that mistake, but knows nothing about synonyms. Hybrid search runs both and fuses the rankings (typically reciprocal rank fusion), buying robustness on names, versions, and identifiers. Know the concept and the failure mode it fixes; the lab stays dense + rerank, hybrid is an exercise.

The upgrade the lab does ship is two-stage retrieval. A reranker is a cross-encoder: instead of comparing two precomputed vectors, it reads the query and a candidate passage together and scores the pair — far more accurate, far too slow to run on a whole collection. So you compose: a fast dense search casts a wide net (top-10, built for recall), then the reranking NIM re-scores those ten pairs and keeps the best three (built for precision). The reranker earns its latency when wrong-but-similar chunks outrank right ones — in practice, most knowledge bases past a few dozen documents.

The whole machine, both paths:

flowchart TB
    subgraph ing ["Ingestion path — after every web_search"]
        direction LR
        WS["web_search results"] --> FP["fetch_page (httpx)"]
        FP --> CL["clean + dedup"]
        CL --> CK["chunk ~500 tokens, 15% overlap"]
        CK --> EP["embed input_type=passage"]
        EP --> DB[("Chroma collection scout_sources")]
    end
    subgraph qry ["Query path — when the agent calls search_sources"]
        direction LR
        AG["agent"] --> SS["search_sources(query)"]
        SS --> EQ["embed input_type=query"]
        EQ --> TK["dense top-10"]
        TK --> RR["rerank NIM (cross-encoder)"]
        RR --> T3["top 3 + citation markers"]
        T3 --> ANS["grounded, cited answer"]
    end
    DB -.-> TK

Two paths, one collection — passage on the way in, query on the way out.

Agentic RAG: retrieval as a tool, not a pipeline

Classic RAG is a pipeline: every question gets embedded, retrieval always runs, the top-k always lands in the prompt. Fine for a documentation chatbot; Scout’s questions aren’t like that — some need the web, some its knowledge base, some no retrieval at all.

Agentic RAG makes retrieval a decision: the knowledge base is exposed as a tool the agent chooses to call — or skip, or retry with a reformulated query, or fall back from. In Scout that tool is search_sources, next to web_search, and the system prompt states the policy: knowledge base first — even when the research plan lists web queries; web for what’s new; pages behind new results get ingested for next time.

	Classic RAG pipeline	Agentic RAG
Who decides to retrieve	Your code — always runs	The agent — per question
Query reformulation	None — the user’s words are the query	The agent rewrites, splits, retries
”Nothing found”	Irrelevant top-k lands in the prompt anyway	The agent falls back (web search) or says so
Cost / latency	One retrieval, fixed, cheap	One extra LLM decision per hop
Sufficient when	Homogeneous questions, one corpus	Open-ended questions, several knowledge sources

flowchart TD
    Q["question"] --> D{"agent decides"}
    D -->|"may overlap past research"| S["search_sources"]
    D -->|"new topic / needs fresh data"| W["web_search + fetch + ingest"]
    D -->|"no retrieval needed"| A["answer directly"]
    S -->|"nothing relevant"| W
    S --> ANS["cited answer"]
    W --> ANS

The decision loop the system prompt encodes.

Objective 6.5 lives here: real-time access and reasoning over structured and unstructured knowledge. The unstructured half is the chunks; the structured half is the metadata riding on them — fetched_at lets the agent (or a filter) judge whether its knowledge is fresh enough and re-search the web when it isn’t: knowledge with a timestamp is knowledge you can distrust on schedule. When knowledge is relational rather than textual — “which services depend on X?” — the right structure is a knowledge graph, placed in Module 3 and left unbuilt in this course.

My opinion, stated as one: retrieval-as-tool wins the moment questions are open-ended, but you pay one extra LLM decision per hop — and on homogeneous questions the boring pipeline is faster, cheaper, and right. Match the structure to the questions, not to the demo.

Grounding and citations: answers you can verify

Grounding means every claim in a generated answer is supported by retrieved source material — the answer stands on chunks, not vibes.

Scout’s grounding contract is mechanical and course-wide: answers carry Markdown markers [n], where n is the 1-based position of the source in ScoutState.sources — the state field born in this module. The difference four lines make:

Before: NVIDIA announced the Nemotron Coalition at GTC 2026, with
        Mistral, LangChain, Cursor, and Perplexity among the members.

After:  NVIDIA announced the Nemotron Coalition at GTC 2026 [1];
        members include Mistral, LangChain, Cursor, and Perplexity [1][2].
        References:
          [1] NVIDIA Newsroom — https://nvidianews.nvidia.com/...
          [2] ...

The first answer asks for trust. The second can be checked, claim by claim, against numbered sources.

In practice the contract is three pieces of plumbing, none clever: the retrieval tool returns each chunk with its citation number and source metadata; the system prompt requires the markers; post-processing resolves the markers actually used into a References list mapped to URLs. The honest limitation: the model can still cite the wrong source, or decorate an unsupported sentence with a plausible [2]. The contract makes claims checkable; measuring how often they check out is an evaluation problem — Module 8’s job.

Hands-on lab: build it

Scout learns to read. The full code lives in module-06/ of the labs repo; one new dependency (uv add "chromadb~=1.5"); NVIDIA_API_KEY and TAVILY_API_KEY in the repo-root .env as usual (the retrieval NIMs bill to the same key).

Objective: fetch the pages behind Scout’s search results, ingest them into a persistent Chroma knowledge base, and expose a search_sources tool that returns reranked, citable passages.

Observable result: a first session fetches and ingests pages; a second session on a neighboring question answers the core question from search_sources — the pages it needs are read from Chroma, not re-fetched — and both answers end with a References: list resolving their [n] markers. uv run pytest module-06/tests/ is green.

Step 1 — Embeddings: one thin, asymmetric function

scout/embeddings.py (new). The embeddings NIM speaks the standard /v1/embeddings route, so llm.get_client() already talks to it. NVIDIA-specific parts travel in extra_body; input_type is required, not defaulted — forgetting it is the silent killer from the theory section:

def embed_texts(texts: list[str], input_type: str) -> list[list[float]]:
    if input_type not in ("passage", "query"):
        raise ValueError(f"input_type must be 'passage' or 'query', got {input_type!r}")
    client = llm.get_client()
    vectors: list[list[float]] = []
    for start in range(0, len(texts), config.EMBED_BATCH_SIZE):
        batch = texts[start : start + config.EMBED_BATCH_SIZE]
        for attempt in range(config.MAX_RETRIES + 1):
            try:
                response = client.embeddings.create(
                    model=config.EMBED_MODEL,
                    input=batch,
                    extra_body={"input_type": input_type, "truncate": "END"},
                )
                break
            except RateLimitError:
                if attempt == config.MAX_RETRIES:
                    raise
                time.sleep(2**attempt)  # NIM free tier: 40 req/min — back off here
        vectors.extend(item.embedding for item in response.data)
    return vectors

Batching is the quiet win: the free tier limits requests, not tokens — 32 chunks per request is a 32× saving on your scarcest resource. The model IDs join config.py (EMBED_MODEL, RERANK_MODEL, RERANK_URL) and nowhere else; the smoke tests grep for leaks.

Step 2 — Extract: `fetch_page`

scout/ingest.py (new). The function Module 2 refused to write — httpx again:

def fetch_page(url: str, timeout: float = 15.0) -> Source:
    """Extract: fetch one page and shape it into the frozen Source contract."""
    response = httpx.get(
        url,
        timeout=timeout,
        follow_redirects=True,
        headers={"User-Agent": "Scout/0.6 (NCP-AAI course lab)"},
    )
    response.raise_for_status()
    raw = response.text
    match = _TITLE.search(raw)
    # Titles end up in one-line reports and reference lists: flatten them.
    title = " ".join(html.unescape(match.group(1)).split()) if match else url
    return {
        "url": url,
        "title": title or url,
        "fetched_at": datetime.now(timezone.utc).isoformat(timespec="seconds"),
        "content": clean_html(raw),
        "reliability_score": NEUTRAL_RELIABILITY,
    }

_TITLE — a <title> regex — and clean_html (next step) live in scout/ingest.py in the repo. The return shape is the course-wide Source contract, added to scout/state.py this module: {url, title, fetched_at, content, reliability_score}. Reliability stays at a neutral 0.5 — the Reader starts computing it in Module 7.

Step 3 — Transform: clean, dedup, chunk

Three functions in scout/ingest.py, in pipeline order. clean_html() drops <script>/<style> and page chrome, strips tags, normalizes whitespace. source_id_for() hashes url + content — the dedup key. chunk_text() is the recursive splitter from the theory section:

def chunk_text(
    text: str,
    size: int = config.CHUNK_CHARS,
    overlap: int = config.CHUNK_OVERLAP_CHARS,
) -> list[str]:
    chunks: list[str] = []
    current = ""
    for piece in _split(text, ["\n\n", "\n", ". "], size):
        if current and len(current) + len(piece) + 1 > size:
            chunks.append(current.strip())
            current = current[-overlap:]  # the tail survives the cut
        current = f"{current}\n{piece}" if current else piece
    if current.strip():
        chunks.append(current.strip())
    return chunks

_split (in the repo) recurses through the separators — paragraphs, then lines, then sentences — and hard-cuts only when a blob has no boundaries. CHUNK_CHARS = 2000 and CHUNK_OVERLAP_CHARS = 300 live in config.py.

Step 4 — Load: the persistent collection

Chroma 1.x, embedded and persistent — a gitignored directory inside the module (module-06/scout_knowledge/). Cosine distance, because the embedding model is trained for it; Chroma defaults to L2:

def get_collection(db_dir: Path | None = None) -> "chromadb.Collection":
    client = chromadb.PersistentClient(path=str(db_dir or DB_DIR))
    return client.get_or_create_collection(
        name=COLLECTION_NAME,
        configuration={"hnsw": {"space": "cosine"}},
    )

ingest_source() chains the stages: dedup check first (an unchanged page costs zero embedding requests — the smoke test asserts it), then chunk, then embed_texts(chunks, input_type="passage"), then upsert with metadata {url, title, fetched_at, source_id, chunk_index} — what makes a chunk citable and filterable.

Step 5 — Two-stage retrieval

scout/retriever.py (new). Stage 1 for recall, stage 2 for precision:

def search(query: str, collection=None) -> list[dict]:
    collection = collection if collection is not None else ingest.get_collection()
    if collection.count() == 0:
        return []
    vector = embeddings.embed_texts([query], input_type="query")[0]
    dense = collection.query(
        query_embeddings=[vector],
        n_results=min(config.TOP_K, collection.count()),
        include=["documents", "metadatas"],
    )
    documents: list[str] = dense["documents"][0]
    metadatas: list[dict] = dense["metadatas"][0]
    best = rerank(query, documents)[: config.RERANK_TOP_N]
    return [
        {
            "text": documents[i],
            "url": metadatas[i]["url"],
            "title": metadatas[i]["title"],
            "fetched_at": metadatas[i]["fetched_at"],
            "source_id": metadatas[i]["source_id"],
        }
        for i in best
    ]

rerank() is a plain httpx POST — the reranking NIM has no route in the openai SDK, so it gets a dedicated REST endpoint, pinned in config.py. The payload pairs the query with each passage; the response’s rankings come back best-first and we keep the top 3.

Step 6 — Retrieval becomes a tool

scout/tools.py (modified): search_sources joins web_search in TOOLS and TOOL_SCHEMAS. The schema description does the steering — prompt engineering, as Module 2 taught:

"description": (
    "Search the sources Scout has already read — full pages fetched "
    "and indexed during this and past sessions. Much faster and "
    "cheaper than a new web search. Try it FIRST whenever the "
    "question may overlap earlier research; fall back to web_search "
    "if it returns nothing relevant. Returns the most relevant "
    "passages with citation markers, source URL, and fetch date."
),

web_search’s description is rewritten too — it must now defer to the knowledge base (“Use this when the knowledge base has nothing relevant, or when the question needs information fresher than what Scout has already read”): Module 2’s “any fact you are not certain about” would fight the knowledge-base-first policy head-on. The same file gains INGEST_MARKER, so describe_observation() can print the [tools] 5 results — 3 pages processed trace line.

Step 7 — Sources and citations in the state

ScoutState gains its module-6 field — sources, append-only like messages — and tools_node (scout/nodes.py) becomes the bookkeeper: after a web_search it fetches and ingests the pages behind the top results, appends a numbered report to the observation, and compacts the snippets of the pages it indexed (their full text now lives in Chroma); search_sources hits get their passages prefixed with markers:

def _number_chunks(observation: str, registry: _SourceRegistry) -> str:
    """Module 06: turn retriever hits into numbered, citable passages."""
    try:
        chunks = json.loads(observation)
    except json.JSONDecodeError:
        return observation  # the retrieval failed; surface the error text
    if not chunks:
        return (
            "No relevant passages in the knowledge base. "
            "Use web_search to research this on the web."
        )
    lines = []
    for chunk in chunks:
        source: Source = {
            "url": chunk["url"],
            "title": chunk["title"],
            "fetched_at": chunk["fetched_at"],
            "content": chunk["text"],
            "reliability_score": ingest.NEUTRAL_RELIABILITY,
        }
        n = registry.number(source)
        lines.append(
            f"[{n}] {chunk['title']} — {chunk['url']} "
            f"(fetched {chunk['fetched_at']})\n{chunk['text']}"
        )
    return "\n\n".join(lines)

The _SourceRegistry (15 lines, in the repo) assigns each new URL the next number and reuses numbers for known URLs — [n] always means the n-th source of this run. The system prompt closes the loop — place the EXACT bracketed markers shown in the tool results; never renumber them, and do not write your own reference list. scout/run.py prints the real References: list by resolving the markers actually used. Two budgets move in config.py (MAX_TOKENS 8192 → 16384, MAX_ITERATIONS 6 → 8 — RAG-fed turns reason longer; the knowledge-base probe costs a turn), but graph.py is untouched — a Module 3 point proven: ingestion and citations are node and tool plumbing, not topology.

Step 8 — Run it twice

cd module-06
uv run python -m scout.run "What is the Nemotron Coalition announced at GTC 2026?" --thread coalition

Real output, after the Planner’s two passes (your URLs and marker numbers will differ; answers trimmed to the load-bearing lines):

[agent] tool_call: search_sources({"query": "Nemotron Coalition"})
[tools] No relevant passages in the knowledge base. Use web_search to research this on the web.
[agent] tool_call: web_search({"query": "Nemotron Coalition GTC 2026 announcement transcript"})
[tools] 5 results — 3 pages processed
[agent] tool_call: search_sources({"query": "founding members of Nemotron Coalition"})
[tools] [2] NVIDIA GTC 2026 Recap: AI Factory, OpenClaw, and Nemotron | Nowel Pitt, MBA posted on the topic
[agent] tool_call: web_search({"query": "NVIDIA newsroom Nemotron Coalition"})
[tools] 5 results — 3 pages processed
The Nemotron Coalition was announced at GTC 2026 as a global collaboration …
to advance open frontier AI models through shared research, data, expertise
and compute [3]. Its founding members … are Black Forest Labs, Cursor,
LangChain, Mistral AI, Perplexity, Reflection AI, Sarvam and Thinking
Machines [3]. …

References:
  [2] NVIDIA GTC 2026 Recap: AI Factory, OpenClaw, and Nemotron | … — https://www.linkedin.com/posts/nowelpitt_…
  [3] NVIDIA Launches Nemotron Coalition of Leading Global AI Labs to Advance Open Frontier Models | NVIDIA Newsroom — http://nvidianews.nvidia.com/news/nvidia-launches-nemotron-coalition-of-leading-global-ai-labs-to-advance-open-frontier-models
  [4] NVIDIA forms Nemotron coalition to advance open AI - Engineering.com — https://www.engineering.com/nvidia-forms-nemotron-coalition-to-advance-open-ai

Scout probed its empty knowledge base, fell back to the web, ingested what it found — then retrieved from the pages it had just indexed. Every decision in that sequence was the model’s. Now the question one inch to the left:

uv run python -m scout.run "Which companies joined the Nemotron Coalition?" --thread coalition-2

[agent] tool_call: search_sources({"query": "Nemotron Coalition GTC 2026 press release membership list"})
[tools] [1] NVIDIA Launches Nemotron Coalition of Leading Global AI Labs to Advance Open Frontier Models | N
[agent] tool_call: search_sources({"query": "site:nemotroncoalition.org member directory"})
[tools] [1] NVIDIA Launches Nemotron Coalition of Leading Global AI Labs to Advance Open Frontier Models | N
[agent] tool_call: web_search({"query": "Nemotron Coalition member roster GTC 2026 news"})
[tools] 5 results — 3 pages processed
The official GTC 2026 press release announces the founding members of the
Nemotron Coalition as Black Forest Labs, Cursor, LangChain, Mistral AI,
Perplexity, Reflection AI, Sarvam and Thinking Machines Lab [1]. … On
June 4 2026, H Company announced that it has joined the coalition [4]. …

References:
  [1] NVIDIA Launches Nemotron Coalition of Leading Global AI Labs to Advance Open Frontier Models | NVIDIA Newsroom — http://nvidianews.nvidia.com/news/…
  [2] NVIDIA forms Nemotron coalition to advance open AI - Engineering.com — https://www.engineering.com/…
  [4] H Company, France computer-use AI champion, joins the NVIDIA Nemotron Coalition - H Company — https://hcompany.ai/h-joins-nemotron-coalition

The roster came straight from the knowledge base — the press release was read from Chroma, not re-fetched. The one web check the Planner (fed Module 5’s covered-topics memory) ordered is what found H Company’s June 4 announcement: each tool did what it is good at. Finally, the tests:

cd ..                                                # back to the repo root
uv run pytest module-06/tests/                       # offline: fixtures + fakes
SCOUT_LIVE_TESTS=1 uv run pytest module-06/tests/    # + NIM roundtrip + 1 real run

Try it yourself (no solution provided):

Freshness filter: make search() ignore chunks older than 7 days. Chroma’s where filters compare numbers, not ISO strings — add a numeric fetched_at_ts (epoch seconds) to chunk metadata at ingestion and filter {"fetched_at_ts": {"$gte": cutoff}} at query time.
Hybrid search, the simple version: BM25 over the collection’s documents (pure Python, ~30 lines), fuse its top-10 with the dense top-10 by reciprocal rank fusion (score = Σ 1/(k + rank) across both rankings, k ≈ 60), then rerank. Test on a query full of exact identifiers — where dense-only retrieval misses.

Exam corner

What the exam tests here. Per the official study guide, Domain 6 (10%) expects you to: implement retrieval pipelines — RAG, embedded search, hybrid approaches (6.1); configure and optimize vector databases for fast retrieval (6.2); build ETL pipelines to integrate external data sources (6.3); conduct data quality checks, augmentation, and preprocessing (6.4); and enable real-time access and reasoning over structured and unstructured knowledge (6.5). Questions are scenarios: a symptom in, a pipeline decision out.

Quiz — answers after question 5.

A knowledge base is chunked at a fixed 100 characters, no overlap. Users complain that answers quote sentence fragments and miss context that sits right next to the quoted text in the source document. Best fix?
- A) Increase top-k from 3 to 10 so more fragments arrive
- B) Re-chunk: structure-aware splitting at a larger size with ~15% overlap, then re-embed the corpus
- C) Switch to a larger LLM with a bigger context window
- D) Add a reranker on top of the existing chunks
A firm’s policy documents change weekly, and compliance requires every answer to point at the exact passage it came from. Which approach fits?
- A) Fine-tune the model on the policy corpus every quarter
- B) Paste the full policy corpus into a 1M-token context per query
- C) A RAG pipeline over the documents, re-ingesting on change, with citations to retrieved chunks
- D) Train a custom model from scratch on company data
For most failing queries the correct passage is in the dense top-10 — but answers keep relying on a similar-sounding wrong passage that ranks higher. Raising top-k to 50 made answers worse. The right move?
- A) Raise top-k further — eventually the right passage will dominate
- B) Add a cross-encoder reranking stage: re-score the top candidates against the query, keep the best few
- C) Lower the chunk size so passages are more precise
- D) Swap the LLM for one with a longer context window
A research assistant must answer only from sources fetched in the last 7 days, keep its index across restarts, and use its embedding model’s training metric. Which configuration delivers all three?
- A) An in-memory collection, default metric, re-ingest everything daily
- B) A persistent collection with cosine distance and a numeric fetched_at metadata filter at query time
- C) A persistent collection with a higher top-k so recent chunks appear
- D) Prepend the fetch date to each chunk’s text so the model can judge freshness
A team’s RAG answers keep quoting cookie banners and navigation menus, and the same article appears three times in retrieval results. Which pipeline stage is missing?
- A) A reranking stage after dense retrieval
- B) A bigger embedding model
- C) Cleaning and deduplication in the transform stage, before embedding
- D) Higher chunk overlap so menus get diluted

Answers. 1 — B. Fragments and lost neighboring context are a chunking diagnosis: too small, no overlap. A retrieves more fragments; D reranks fragments — neither restores context destroyed at ingestion. 2 — C. Weekly change kills A (stale between retrains — fine-tuning shapes behavior, not citable knowledge) and D (costlier, same staleness); B re-pays the corpus per query with no passage-level citation. Fresh + traceable is RAG’s home turf. 3 — B. The candidates are retrieved but mis-ranked — precisely the reranker’s job. A and D add noise; C re-chunks a corpus whose recall is already fine. 4 — B. Three requirements, three knobs: persistence, cosine, metadata filter. A re-pays ingestion daily; C hopes instead of filters; D pollutes embeddings with date strings. The filter is the structured (metadata) half of objective 6.5’s reasoning over structured + unstructured knowledge. 5 — C. Boilerplate in results and duplicates are ingestion defects: clean and dedup before embedding is paid. A reranker (A) ranks garbage more carefully; B embeds it more accurately; D spreads it around.

Traps to avoid:

“Fine-tuning adds fresh knowledge.” It shapes behavior — style, format, vocabulary. Fresh, citable knowledge is RAG’s job; when a scenario pairs “changes often” with “must cite sources,” fine-tune options are distractors.
Two token budgets, not one. The embedding model’s input limit (8192 tokens here) and the LLM’s context window (1M for Nemotron 3 Nano) are independent. Chunk size respects the embedder — chosen well below its ceiling, for precision; the context budget constrains generation.
“Raise top-k to improve answers.” Top-k buys recall; past a point it pollutes the context. Precision comes from reranking and cleaner ingestion, not volume — if the right chunk is retrieved but ignored, more chunks make it worse.
Boilerplate in retrieval results is an ingestion defect. Fix the ETL transform stage — clean and dedup before embedding — not the prompt or the reranker: they rank and phrase garbage, they don’t remove it.
“Structured + unstructured” is one query, two halves. Objective 6.5’s real-time reasoning pairs unstructured chunks with the structured metadata riding on them — timestamps, sources, filters. An option that handles the text but ignores the metadata (or the reverse) answers half the question.

Key takeaways

RAG earns its place with what weights can’t offer: fresh, private, citable knowledge — a 1M-token context window doesn’t change that.
Memory (Module 5) stores facts about the user; the knowledge base stores external content. Different store, different role — the exam tests the distinction.
Ingestion is ETL, and quality is decided in the transform: clean and dedup before embedding, or retrieve boilerplate forever.
Chunking is a trade-off, not a dogma: small = precise but contextless, large = contextual but noisy; structure-aware with ~15% overlap is the sane default.
Retrieve wide, rerank narrow — dense top-10 for recall, cross-encoder top-3 for precision — and embed asymmetrically: passage to index, query to search; mixing them fails silently.
Agentic RAG means the agent decides when to retrieve, reformulate, or fall back — worth one extra LLM decision per hop on open-ended questions.
Citations [n] are a contract, not decoration: each marker resolves mechanically to a source in state — what makes answers checkable at all.

Keep going

Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.

Scout has knowledge and memory. Next, it gets coworkers: a supervisor, a Searcher, a Reader, a Fact-checker, and a Writer.

Lab code · Course index · ← Module 5 · Module 7 →

References

NCP-AAI certification page — the official blueprint; Knowledge Integration and Data Handling is 10%.
llama-nemotron-embed-1b-v2 — API reference — the embedding NIM: 2048 dims, 8192-token input limit, input_type.
llama-nemotron-rerank-1b-v2 — API reference — the reranking NIM and its dedicated retrieval endpoint.
Chroma docs — configuring collections — the 1.x configuration block, including HNSW distance settings.
Lewis et al., 2020 — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks: the paper that named RAG.
Building RAG Agents with LLMs — the NVIDIA DLI course the study guide recommends for this domain.