Evaluating AI Agents: Metrics, LLM-as-Judge, and Tuning (NCP-AAI Module 8)
This is Module 8 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.
By the end of Module 7, Scout produces genuinely impressive reports: a supervisor coordinates four specialists, claims get cross-checked, citations resolve. Then you change one line of the Writer’s prompt, re-run, and the report comes out… different. Better? You read both versions twice. You hesitate. You ask a colleague; they hesitate too, then pick the one with nicer headings.
That’s demo-driven development: every “improvement” is an opinion, and every regression is invisible until a user finds it. The exam dedicates an entire domain to the fix — Evaluation and Tuning, 13% of your score, covered only by this module. By the end, one command prints before/after scores for Scout and settles the argument with numbers instead of vibes.
In this module
- You’ll learn:
- Implement an evaluation pipeline: golden set, deterministic checks, LLM-as-judge, regression script.
- Distinguish model evaluation from agent evaluation, and choose metrics per use case.
- Calibrate an LLM-as-judge: rubric design, structured scoring, known biases and their mitigations.
- Tune with discipline: prompts, inference parameters, model and architecture levers — one change at a time.
- Analyze results by category to target the next fix, and grow the golden set from structured user feedback.
- You’ll build: an eval harness for Scout — a 15-question golden set, an LLM-as-judge scoring grounding, coverage, and citations, and a regression script that proves a prompt improvement.
- Exam domains covered: D3 — Evaluation and Tuning — 13% of the exam.
- Prerequisites: Modules 1–7 (Scout’s
multi-agent team runs end to end);
NVIDIA_API_KEYand Tavily key configured.
Where you are
- ✅ Modules 1–3 — foundations: first NIM call, ReAct + tool calling, architecture
- ✅ Modules 4–6 — cognition and knowledge: Planner + reflection, memory, RAG with citations
- ✅ Module 7 — Multi-Agent Systems — supervisor + Searcher, Reader, Fact-checker, Writer
- 👉 Module 8 — Evaluating AI Agents (you are here)
- ⬜ Modules 9–14 — guardrails, deployment, observability, NVIDIA stack, capstone, the exam
Scout before: a complete multi-agent team producing cited reports — and no
way to measure their quality. Scout after: an evals/ harness — golden
set, judge, regression script — the safety net every change rides on from here
to v1.0.
Why agent evaluation is not model evaluation
You’ve seen model benchmarks: MMLU-style suites where a model answers thousands of fixed questions against an answer key. That machinery measures a model — one input, one completion, one verifiable answer, on a dataset that never changes. It tells you the engine is good — and nothing about the car.
An agent is the car. Scout’s value isn’t a single completion; it’s a multi-step task — plan, search a live web that changes between runs, fetch, cross-check, write — performed by a planner and four coordinated specialists with tool calls in the loop and nondeterminism at every step. A model that benchmarks beautifully can still power an agent that searches in circles, skips its Fact-checker, and cites sources it never read. “The model is good” and “the agent is good” are different claims — and the exam tests which machinery validates which.
| Model evaluation | Agent evaluation | |
|---|---|---|
| Unit evaluated | One model, one completion | A whole system on an end-to-end task |
| Data | Static benchmark dataset | Live environment (the web changes under you) |
| Determinism | Mostly repeatable | Multi-step and nondeterministic — same input, different paths |
| What can break | The model’s knowledge or reasoning | Routing, tool calls, handoffs, any node, the environment |
| Cost of a run | One inference per item | 10–20 LLM calls + tool calls per item |
Agent evaluation therefore needs its own vocabulary. Task success rate is the fraction of tasks where the agent achieved the user’s actual goal — the headline agent metric, scored against the task, not a reference string. And every run has two things worth grading: the result and the path. The path is the trajectory — the sequence of steps the agent took: which nodes ran, which tools were called, in what order, at what cost. Outcome evaluation judges the final artifact (is the report correct, grounded, cited?); trajectory evaluation judges the path (did the Fact-checker run? were three of the seven searches redundant?). You need both lenses on the same run:
flowchart LR
run["One Scout run"] --> outcome["Outcome lens:<br/>the final report —<br/>correct, grounded, cited?"]
run --> traj["Trajectory lens:<br/>the path — node order,<br/>tool calls, redundancy,<br/>skipped stages"]
The concrete failure that makes the second lens non-optional: in Scout’s
graph, the supervisor decides which specialist runs next. A routing bug — or
a lazy supervisor turn — can skip the Fact-checker entirely, and
Module 7’s Writer still produces a fluent,
confident, nicely cited report. Outcome-only evaluation grades it and smiles;
the trajectory shows claims[] empty and the verification stage missing.
Golden sets: your agent’s regression suite
You already trust regression suites for code: fixed tests that must stay green
after every change. A golden set is the same idea for agent behavior — a
versioned collection of representative questions, with expected results, that
you re-run after every change to detect improvement or regression. Scout’s
lives in evals/golden_set.json: 15 authored questions, committed to the
repo, because a golden set is executable specification.
Three design rules make a golden set worth its tokens:
- Representative and stratified. Scout’s 15 questions split across four
categories —
factual,comparative,recent,multi-hop— because aggregate scores hide category-level failures. - Expected points, not expected answers. A research agent has no single correct string — two good reports on one question can share zero sentences. So each entry lists 3–5 facets a good answer must cover, and the judge scores facet coverage, not string similarity.
- Versioned like code. Change the questions and every previous score becomes incomparable. Treat edits like schema migrations — deliberate, and visible in the diff.
One entry, in the frozen schema the lab uses (and later modules reuse as-is):
{
"id": "c02",
"question": "When should an AI system use a knowledge graph instead of a vector database?",
"category": "comparative",
"expected_points": [
"Vector databases retrieve by semantic similarity over embeddings",
"Knowledge graphs store entities and typed relationships",
"Multi-hop relational questions follow edges, which similarity search cannot do",
"Many production systems combine both"
],
"min_sources": 2
}
Where do new entries come from once you’re live? The best source is structured user feedback — feedback collected through a rubric with specific questions (Was the report on topic? Do you trust the sources? Right depth? What was missing?) rather than a bare thumbs-up. Per the official study guide, objective 3.3 is exactly this loop: a 👍/👎 tells you that something is wrong; a rubric answer like “off topic for comparative questions” tells you what — and converts directly into a new golden-set entry that reproduces the failure, so re-running proves the fix.
Metrics that matter: task success, grounding, cost, latency
With a golden set in hand, what do you measure per run? Domain 3 expects you to pick metrics per use case, so internalize this table — which metric catches what, and what it costs to measure. Several rows lean on an LLM-as-judge — an LLM grading outputs against a rubric, dissected in the next section:
| Metric | What it catches | How it’s measured |
|---|---|---|
| Task success rate | The agent doesn’t achieve the goal | LLM-as-judge or human, against the task |
| Grounding | Statements not supported by any source | LLM-as-judge (sources vs. report) |
| Citation accuracy | Markers that point at the wrong/no source | Code check (resolution) + judge (precision) |
| Coverage | Expected facets missing from the answer | LLM-as-judge, against expected_points |
| Cost per run | Token burn creeping up per change | Code: count calls/tokens |
| Latency | Runs drifting from 40s to 4 minutes | Code: wall-clock per run |
| Tool-call efficiency | Redundant searches, skipped stages (trajectory) | Code: count/inspect tool calls and node visits |
Grounding — the degree to which the output’s statements are supported by
the sources actually retrieved — is the flagship metric for a research agent:
an ungrounded report is a liability with citations on it. It differs from task
success (a grounded report can still miss the question) and from citation
accuracy (a well-grounded statement can carry the wrong [n] marker). For a
customer-support agent you’d weight task success and latency instead; choosing
the metric for the use case is precisely what exam scenarios test.
Now the cost discipline: the table’s “code check” rows cost zero tokens; the
judge rows bill a strong model per question. In production you’d gate judge
calls on the cheap layer — no tokens spent grading a report whose citations
don’t even resolve. Scout’s harness instead judges every run, flunked or not,
so all rows of the results schema stay comparable; what it keeps from the
discipline is the order. It runs deterministic checks — pure-code
assertions on the output: do all [n] citations resolve into sources[]?
does every Claim carry a verdict? are there at least min_sources sources?
is the length sane? — before any judge call, every time: free catches
first, then the paid verdict.
Notice what makes those checks possible: the course’s frozen contracts. Reports
cite [n] markers into a typed sources[] array
(Module 6 built that), and the Fact-checker
fills claims[] with verdicts (Module 7). Because the output is structured,
“are the citations broken?” is a regex, not a judgment call — schema discipline
pays its rent in evaluation.
LLM-as-judge: power and pitfalls
Deterministic checks can’t tell you whether a report is good — whether its statements follow from the sources, whether it covered what matters. Those are semantic judgments with no single right answer. LLM-as-judge is the technique of using an LLM (usually a stronger one) to score another model’s output against an explicit rubric — it’s how agent outputs get evaluated at scale, because nobody pays humans to read 15 research reports after every prompt tweak.
Three practices separate a usable judge from a random-number generator. First,
an anchored rubric: every dimension gets concrete level descriptions (“5:
every substantive statement traces to a listed source… 1: invents facts
freely”) — “rate 1–5” with no anchors invites drift. Second, structured
output: the judge returns a typed object (a Pydantic JudgeScore with
grounding, coverage, citations, each 1–5, plus a rationale), so scores
parse, validate, and aggregate — and a malformed verdict retries with the
validation error instead of silently passing. Third, temperature 0: a
measuring instrument should not be creative.
Judges come in two shapes: pointwise (score one output against the rubric) and pairwise (compare two outputs and pick a winner). Pointwise scores trend over time; pairwise is more sensitive for A/B decisions but imports the nastiest bias below. A judge is a model with model failure modes — Zheng et al. measured the first three systematically in the MT-Bench paper:
| Judge bias | What it looks like | Mitigation |
|---|---|---|
| Position bias | In pairwise mode, the first (or last) answer wins more | Judge both orders; only count agreement |
| Verbosity bias | Longer answers score higher, content equal | Rubric rewards density; length-cap or normalize |
| Self-preference bias | A judge favors output from its own model/family | Judge ≠ agent’s model — different family or size |
| Leniency drift | Scores inflate over time or after judge updates | Anchored rubric; re-judge a fixed sample on every judge change |
The course’s choice makes one mitigation structural: Scout runs on
nvidia/nemotron-3-nano-30b-a3b, while the judge is
nvidia/nemotron-3-super-120b-a12b — a stronger model (better judgment)
that is also a different size class (self-preference mitigated by
construction). Never let an agent’s own model grade its homework. My working
rule: hand-label about 10% of judged samples and check agreement before
believing any judge — if you and the judge disagree on 4 reports out of 10,
fix the rubric, not the agent.
The ecosystem, briefly: openevals ships ready-made judge evaluators on the
LangChain side, and ragas (0.4.x) provides RAG metrics — its faithfulness
is essentially our grounding. The lab builds the judge by hand anyway: a
hundred-odd lines that teach you what those libraries wrap.
One sentence to tattoo somewhere: an LLM judge is not an objective referee — it’s a measuring instrument, and instruments get calibrated or they get distrusted.
The tuning loop: from scores to targeted fixes
Scores exist to cause correct action, and the first rule of acting on them is: never read only the aggregate. Suppose Scout’s overall grounding is 4.1/5 — fine, apparently. Split by category and you might find 4.6 on factual questions and 2.8 on comparative ones: Scout grounds single-fact answers well but free-styles when synthesizing two positions. Now you know where to act (the Writer’s synthesis behavior, or the Planner’s decomposition of comparative questions) instead of “improve prompts somewhere.” That category-level analysis is objective 3.5 verbatim — analyze evaluation results to guide targeted optimization — and it’s why the golden set was stratified in the first place.
Once you know where, choose the cheapest lever that can plausibly fix it:
| Tuning lever | Cost of change | Typical gain | When to use |
|---|---|---|---|
| Prompt (instructions, rubric, examples) | Minutes; no infra | Often large | First, almost always |
| Inference parameters (temperature, max output tokens, reasoning budget) | Minutes; no infra | Moderate; big latency/cost wins | Accuracy vs. latency-efficiency trade-offs |
| Model swap (per node) | Hours; re-validate everything | Large either direction | A node under- or over-powered for its job |
| Architecture (add/remove/rewire nodes) | Days; state and evals churn | Large | When the structure itself is the bottleneck |
An exam-critical clarification on row two, because objective 3.4 says “tune model parameters” and trips people up: in Domain 3 that means inference-time parameters — temperature, output budgets, reasoning effort, which model serves which node — and the accuracy vs. latency-efficiency trade-offs they control. It does not mean training or fine-tuning weights. Concretely: moving every Scout node to a larger model might lift grounding by 0.2 and triple your p95 (95th-percentile) latency; the tuned answer is usually a small, fast model for routing turns and the stronger model only where judgment lives.
Then the discipline that turns levers into proof — the loop the lab builds:
flowchart TD
GS["Golden set<br/>(15 questions, versioned)"] --> RUN["Run Scout<br/>per question"]
RUN --> DC["Deterministic checks<br/>(free, always first)"]
DC --> J["LLM-as-judge<br/>grounding · coverage · citations"]
J --> CAT["Scores by category"]
CAT --> DECIDE{"Tuning decision:<br/>prompt? params?<br/>model? architecture?"}
DECIDE -->|"change ONE variable"| RERUN["Re-run the golden set"]
RERUN --> CMP["Compare vs. baseline<br/>(flag regressions)"]
CMP --> CAT
Change one variable at a time, re-run the same golden set, compare against the stored baseline, and treat any metric that drops beyond a threshold as a regression to investigate. Two changes per run means zero attributable results — demo-driven development with extra steps.
Hands-on lab: build it
You’ll build the full harness, then use it as intended: prove that one prompt
improvement to the Writer raises citation quality without degrading anything
else. The complete code lives in
module-08/
of the labs repo.
Objective: build Scout’s eval harness (golden set, checks, judge, runner, comparator) and ship a proven prompt fix.
Observable result: at the end, cd module-08 && uv run evals/compare_runs.py evals/results/run-baseline.json evals/results/run-tuned.json prints a per-metric delta table where the
overall citations delta is positive and nothing is flagged REGRESSION.
Step 1 — Read the golden set
Open evals/golden_set.json: 15 authored questions in the schema you saw
above — stratified across the four categories, with expected_points facets
and a min_sources floor per question. Committed, never generated at
runtime: a golden set you can’t diff is a golden set you can’t trust.
Step 2 — Deterministic checks
evals/checks.py asserts the frozen contracts on a finished run. Its heart —
citation resolution against sources[]:
# The real Writer emits single markers [2] AND combined ones [1,3] —
# both are part of the report contract, so both must parse.
CITATION_PATTERN = re.compile(r"\[(\d+(?:\s*,\s*\d+)*)\]")
def cited_numbers(report: str) -> set[int]:
"""Every source number cited in the report, [2] and [1,3] alike."""
numbers: set[int] = set()
for group in CITATION_PATTERN.findall(report):
numbers.update(int(part) for part in group.split(","))
return numbers
def _citations_resolve(report: str, sources: list[dict]) -> CheckResult:
"""Every cited [n] must point inside sources[] — the frozen report contract."""
cited = cited_numbers(report)
dangling = sorted(n for n in cited if n < 1 or n > len(sources))
return CheckResult(
name="citations_resolve",
passed=not dangling,
detail=(
f"all citations resolve into {len(sources)} source(s)"
if not dangling
else f"dangling citations {dangling} with only {len(sources)} source(s)"
),
)
The other checks follow the same shape: every claim has a valid verdict (an
empty claims[] fails — the bypassed-Fact-checker smell), source count meets
min_sources, report length within bounds. Test them against the two committed
fixtures — one good report, one with planted defects:
# from the repo root
uv run pytest module-08/tests/ -k checks
The broken fixture must fail exactly four named checks. If your checks pass it, your checks are decorative.
Step 3 — The judge
The score schema and the rubric’s anchoring style, from evals/judge.py:
class JudgeScore(BaseModel):
"""The judge's verdict on one report.
Field names are FROZEN (06 §2): they flow unchanged into
evals/results/*.json and are reused later in the series.
"""
grounding: int = Field(ge=1, le=5)
coverage: int = Field(ge=1, le=5)
citations: int = Field(ge=1, le=5)
rationale: str
RUBRIC = """\
...
GROUNDING — are the report's statements supported by the listed sources?
5: every substantive statement traces to a listed source; nothing invented
3: most statements supported; one or two go beyond what the sources say
1: the report contradicts the sources or invents facts freely
...
Style is NOT a dimension: a short report in bullet points is fine. Combined
markers like [1,3] count as precise citations. A statement honestly flagged
as unverified or contested is good practice — never punish that honesty
under grounding; only punish UNFLAGGED statements the sources do not back.
Answer with ONLY a JSON object — no markdown fences, no commentary — with
exactly these keys:
{"grounding": <1-5>, "coverage": <1-5>, "citations": <1-5>, "rationale": "<two sentences>"}
"""
Without that style paragraph, the judge docks grounding points for the exact
honesty the Writer is instructed to produce — flagged unverified claims. A
rubric tells your instrument what to punish and what not to.
The call runs config.JUDGE_MODEL at temperature 0 with the same 429 backoff
as every NIM call in this course, and a failed Pydantic validation retries
once, carrying the error message back to the model — the Module 4 pattern.
Step 4 — Baseline run
evals/run_evals.py ties it together: for each question, run the full Scout
graph against a fresh, isolated knowledge base (otherwise the Fact-checker
starts retrieving evidence left over from the previous question), print cheap
trajectory stats (nodes visited, worker turns, wall-clock), run the free
checks, then pay for the judge:
# Deterministic checks first — they are free; the judge bills tokens.
check_results = checks.run_checks(report, sources, claims, entry["min_sources"])
flunked = checks.failed(check_results)
score = judge.judge_report(entry["question"], entry["expected_points"], report, sources)
Module 08 ships the tuned Writer prompt, so restore Module 7’s baseline prompt first, then run:
cd module-08
cp ../module-07/scout/agents/writer.py scout/agents/writer.py
uv run evals/run_evals.py --limit 5 --out evals/results/run-baseline-mine.json
--limit 5 evaluates the first five questions — deliberately one of each
category, plus one. Expect several minutes: each question is a full
multi-agent run of 10–20 LLM calls under the free tier’s 40 req/min (as of
June 2026), plus a judge call. The -mine suffix is deliberate: the
committed run-baseline.json / run-tuned.json are the reference
15-question runs my numbers below come from — results are versioned like
code, so write yours next to them, not over them.
Step 5 — Tune one variable
Bring back the tuned prompt (git checkout -- scout/agents/writer.py). The
Writer’s system prompt gains exactly one block over Module 7’s:
Citation discipline:
- Place each [n] marker on the exact statement it supports — never
collect the citations at the end of a paragraph.
- The cited source must actually make the statement; never decorate a
sentence with a number whose source says something else.
- Never cite a number that is not in the numbered source list.
- If no listed source supports a statement, drop it or mark it
explicitly as unverified — an honest gap beats a decorated guess.
- List each cited source exactly once in References, as '[n] url'.
Nothing else changes. One variable, or the comparison proves nothing.
Step 6 — Re-run and prove it
uv run evals/run_evals.py --limit 5 --out evals/results/run-tuned-mine.json
uv run evals/compare_runs.py evals/results/run-baseline-mine.json evals/results/run-tuned-mine.json
Your command prints Comparing 'baseline-mine' -> 'tuned-mine' over 5-question
means. The block below is instead the reference comparison — the committed
run-baseline.json / run-tuned.json, full 15-question runs on both prompts:
Comparing 'baseline' -> 'tuned' (model: nvidia/nemotron-3-nano-30b-a3b)
Overall (gated)
metric baseline tuned delta
grounding 4.00 4.20 +0.20
coverage 2.60 2.73 +0.13
citations 4.07 4.33 +0.26
... (comparative and factual tables omitted) ...
Category: multi-hop (diagnostic, small n)
metric baseline tuned delta
grounding 4.33 3.00 -1.33
coverage 2.33 2.33 +0.00
citations 4.00 3.67 -0.33
... (recent table omitted) ...
No regression flagged.
Those are my real numbers (my own 5-question quick pass swung on one
category, so I ran the whole set).
citations rose +0.26 with grounding and coverage up too — the fix is real,
and provably not a trade. Now read the multi-hop diagnostic like an
engineer: a -1.33 grounding swing on a 3-question slice traces to one
degenerate run (re-judged, that question scores 5/5 on grounding). That is
why the REGRESSION gate reads the overall aggregates while the category
tables are targeting diagnostics — they say “look at multi-hop next”, not
“reject this change”. If your own 5-question delta is flat or one category
looks wild, run the full set on both prompts (omit --limit; the baseline
half needs Step 4’s cp again first, the tuned half Step 5’s
git checkout; fifteen multi-agent runs under 40 req/min is a long
lunch — start it and walk away).
Step 7 — Verify the module
Back at the repo root (cd ..):
uv run pytest module-08/tests/ # offline: schemas, checks, judge contract
SCOUT_LIVE_TESTS=1 uv run pytest module-08/tests/ # + 1 judge call + 1 full eval run
The offline tests validate the golden-set schema, the checks against both fixtures, the judge’s structured-output contract (scripted client), and the comparison logic — zero API calls, so CI stays free. Modules 01–07’s suites stay green on the carried-over code: the repo’s cumulative rule.
Try it yourself (no solution provided):
- Make the golden set yours: add 5 questions from your own domain with
honest
expected_points, re-run the comparison — does the prompt fix hold, or did it overfit to the course’s questions? (While your questions are in place,test_golden_set_schema— exactly 15 questions — is red; expected. Restore the frozen file withgit checkout -- evals/golden_set.json.) - Go pairwise: write a
judge_pairwise(question, report_a, report_b)variant and call it twice with the order swapped. If the winner flips, you have reproduced position bias on your own instrument — only count a win when both orders agree.
Exam corner
What the exam tests here. Per the official study guide, Domain 3 (13%) has five objectives, each covered in a section of this module:
| Objective | Where it’s covered |
|---|---|
| 3.1 Implement evaluation pipelines and task benchmarks | ”Why agent evaluation…”, “Metrics that matter”, the lab |
| 3.2 Compare agent performance across tasks and datasets | Golden-set stratification + compare_runs |
| 3.3 Collect and integrate structured user feedback | ”Golden sets” — the feedback → eval-case loop |
| 3.4 Tune model parameters (accuracy, latency-efficiency) | “The tuning loop” — the levers table |
| 3.5 Analyze evaluation results to guide targeted optimization | ”The tuning loop” — scores by category |
Quiz — answers after question 5.
-
A team runs two agents: a customer-support agent resolving tickets in chat, and a research agent producing sourced market briefs. They want to pick primary metrics per agent. Which pairing fits best?
- A) Support: grounding; research: latency
- B) Support: task success rate and latency; research: grounding and citation accuracy
- C) Both: tokens per run — cost is what management reads
- D) Both: an MMLU-style benchmark on the underlying models
-
Your pairwise LLM-as-judge compares report A and report B and picks A. You re-run with the two reports in the opposite order, and it picks B. What’s happening, and what’s the correct response?
- A) The reports are equal; record a tie and move on
- B) The judge’s temperature is too low; raise it for more decisive verdicts
- C) Position bias; evaluate both orders and only count a result when the verdicts agree
- D) Self-preference bias; switch the judge to the same model the agent uses
-
An agent returns a correct final answer, but the run made 12 tool calls where 4 would do, and the verification stage never executed. Which evaluation approach detects this — and outcome-only evaluation would miss it?
- A) Trajectory evaluation: inspect the path — node visits, tool-call counts, skipped stages
- B) A larger golden set: more questions surface more failures
- C) Pairwise judging against a reference report
- D) Re-running the same question at temperature 0 to remove nondeterminism
-
After switching every node of a multi-agent system to a much larger model, quality ticks up slightly but p95 latency triples and costs spike. Within Domain 3’s “tune model parameters,” what’s the best corrective?
- A) Fine-tune the large model on the team’s data to make it faster
- B) Keep the large model only where judgment matters; use a small, fast model for routing turns, and cap output/reasoning budgets per node
- C) Remove the iteration cap so runs can finish in fewer, longer turns
- D) Disable evaluation runs to recover the spent latency budget
-
Users flag, through a feedback rubric, that reports on comparative questions are frequently off topic. What’s the correct way to integrate this feedback?
- A) Immediately rewrite the Planner prompt — users have spoken
- B) Add the flagged questions to the golden set with expected points, reproduce the low scores, fix one variable, and re-run to prove the gain
- C) Lower the judge’s coverage threshold so the reports pass again
- D) Ask users to rephrase comparative questions as factual ones
Answers. 1 — B. Metrics follow the use case: a support agent is judged on resolving the task fast (task success, latency); a research agent on output trustworthiness (grounding, citation accuracy). C measures cost, not quality; D benchmarks the model — the model-vs-agent confusion. 2 — C. A winner that flips with presentation order is position bias, the classic pairwise failure; the mitigation is order-swapping with agreement. D creates self-preference bias rather than fixing anything. 3 — A. Redundant calls and a skipped verification stage are path defects: only trajectory evaluation (node visits, tool-call counts) sees them, because the outcome is correct by assumption. B, C, and D all grade outcomes harder. 4 — B. Domain 3’s “tune model parameters” means inference-time choices and the accuracy vs. latency-efficiency trade-off: right-size the model per node, budget output and reasoning tokens. A is weight training — out of scope; C worsens worst-case latency; D is giving up measurement. 5 — B. Structured feedback’s job is to become evaluation cases: reproduce, fix one variable, re-run, prove. A is a blind fix you can’t verify; C is gaming the instrument; D blames the user.
Traps to avoid:
- Model benchmark ≠ agent evaluation. MMLU scores and leaderboards grade a model’s knowledge on static data; agent evaluation grades a system on end-to-end tasks, trajectories included, in a live environment. The exam loves offering a model benchmark as a distractor for an agent-quality question.
- “Tune model parameters” ≠ retraining. In Domain 3, tuning means inference parameters and accuracy/latency-efficiency trade-offs — prompts, temperature, budgets, model selection per node. Fine-tuning weights is a different discipline.
- Judge scores ≠ ground truth. Without human-labeled calibration samples and bias mitigations, an LLM judge’s scores are precise-looking noise. Any answer that treats judge output as objective truth is bait.
- A bare 👍/👎 is not structured feedback. Objectives 3.3 and 10.2 mean feedback a pipeline can act on — rubric dimensions, missing points, wrong citations — routed into the golden set. A thumbs-down says that something failed, never what.
Key takeaways
- Agent evaluation is not model evaluation: it grades an end-to-end task in a live environment, trajectories included — a great model can still power a broken agent.
- A golden set is your agent’s regression suite: stratified, versioned, expected points rather than answers — and grown from structured user feedback.
- Deterministic checks run before the judge: free code catches broken citations and missing verdicts before any judge token is spent.
- An LLM judge is an instrument, not a referee: anchored rubric, typed output, temperature 0, a stronger different model — calibrated against human labels.
- Evaluate outcome and trajectory: a beautiful report can hide a skipped Fact-checker and twelve redundant tool calls.
- Read scores by category — aggregates hide exactly the failures you need to find.
- Tune one variable at a time and prove it: baseline, change, re-run, compare, flag regressions.
Keep going
Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.
Scout reads arbitrary web content — which means arbitrary web content can talk to Scout. Next module: guardrails against prompt injection, and a human approval gate on the research plan.
Lab code · Course index · ← Module 7 · Module 9 →
References
- NCP-AAI certification page — the official blueprint; Evaluation and Tuning weighs 13%, and the study guide’s objectives 3.1–3.5 structure this module.
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al.; measured position, verbosity, and self-preference bias and validated LLM judges against human preferences.
- Mastering Agentic Techniques: AI Agent Evaluation — NVIDIA Technical Blog on why agent evaluation differs from model evaluation.
- Ragas documentation — RAG evaluation metrics (0.4.x); its faithfulness corresponds to this module’s grounding.
- openevals — ready-made LLM judge evaluators from the LangChain ecosystem.
- Evaluating RAG and Semantic Search Systems — the NVIDIA DLI course officially recommended for Domain 3.