Evaluating AI Agents: Metrics, LLM-as-Judge, and Tuning (NCP-AAI Module 8)

Module 8 of 14 22 min read D3 · 13% Lab code ↗

This is Module 8 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.

By the end of Module 7, Scout produces genuinely impressive reports: a supervisor coordinates four specialists, claims get cross-checked, citations resolve. Then you change one line of the Writer’s prompt, re-run, and the report comes out… different. Better? You read both versions twice. You hesitate. You ask a colleague; they hesitate too, then pick the one with nicer headings.

That’s demo-driven development: every “improvement” is an opinion, and every regression is invisible until a user finds it. The exam dedicates an entire domain to the fix — Evaluation and Tuning, 13% of your score, covered only by this module. By the end, one command prints before/after scores for Scout and settles the argument with numbers instead of vibes.

In this module

  • You’ll learn:
    • Implement an evaluation pipeline: golden set, deterministic checks, LLM-as-judge, regression script.
    • Distinguish model evaluation from agent evaluation, and choose metrics per use case.
    • Calibrate an LLM-as-judge: rubric design, structured scoring, known biases and their mitigations.
    • Tune with discipline: prompts, inference parameters, model and architecture levers — one change at a time.
    • Analyze results by category to target the next fix, and grow the golden set from structured user feedback.
  • You’ll build: an eval harness for Scout — a 15-question golden set, an LLM-as-judge scoring grounding, coverage, and citations, and a regression script that proves a prompt improvement.
  • Exam domains covered: D3 — Evaluation and Tuning — 13% of the exam.
  • Prerequisites: Modules 1–7 (Scout’s multi-agent team runs end to end); NVIDIA_API_KEY and Tavily key configured.

Where you are

  • ✅ Modules 1–3 — foundations: first NIM call, ReAct + tool calling, architecture
  • ✅ Modules 4–6 — cognition and knowledge: Planner + reflection, memory, RAG with citations
  • ✅ Module 7 — Multi-Agent Systems — supervisor + Searcher, Reader, Fact-checker, Writer
  • 👉 Module 8 — Evaluating AI Agents (you are here)
  • ⬜ Modules 9–14 — guardrails, deployment, observability, NVIDIA stack, capstone, the exam

Scout before: a complete multi-agent team producing cited reports — and no way to measure their quality. Scout after: an evals/ harness — golden set, judge, regression script — the safety net every change rides on from here to v1.0.

Why agent evaluation is not model evaluation

You’ve seen model benchmarks: MMLU-style suites where a model answers thousands of fixed questions against an answer key. That machinery measures a model — one input, one completion, one verifiable answer, on a dataset that never changes. It tells you the engine is good — and nothing about the car.

An agent is the car. Scout’s value isn’t a single completion; it’s a multi-step task — plan, search a live web that changes between runs, fetch, cross-check, write — performed by a planner and four coordinated specialists with tool calls in the loop and nondeterminism at every step. A model that benchmarks beautifully can still power an agent that searches in circles, skips its Fact-checker, and cites sources it never read. “The model is good” and “the agent is good” are different claims — and the exam tests which machinery validates which.

Model evaluationAgent evaluation
Unit evaluatedOne model, one completionA whole system on an end-to-end task
DataStatic benchmark datasetLive environment (the web changes under you)
DeterminismMostly repeatableMulti-step and nondeterministic — same input, different paths
What can breakThe model’s knowledge or reasoningRouting, tool calls, handoffs, any node, the environment
Cost of a runOne inference per item10–20 LLM calls + tool calls per item

Agent evaluation therefore needs its own vocabulary. Task success rate is the fraction of tasks where the agent achieved the user’s actual goal — the headline agent metric, scored against the task, not a reference string. And every run has two things worth grading: the result and the path. The path is the trajectory — the sequence of steps the agent took: which nodes ran, which tools were called, in what order, at what cost. Outcome evaluation judges the final artifact (is the report correct, grounded, cited?); trajectory evaluation judges the path (did the Fact-checker run? were three of the seven searches redundant?). You need both lenses on the same run:

flowchart LR
    run["One Scout run"] --> outcome["Outcome lens:<br/>the final report —<br/>correct, grounded, cited?"]
    run --> traj["Trajectory lens:<br/>the path — node order,<br/>tool calls, redundancy,<br/>skipped stages"]

The concrete failure that makes the second lens non-optional: in Scout’s graph, the supervisor decides which specialist runs next. A routing bug — or a lazy supervisor turn — can skip the Fact-checker entirely, and Module 7’s Writer still produces a fluent, confident, nicely cited report. Outcome-only evaluation grades it and smiles; the trajectory shows claims[] empty and the verification stage missing.

Golden sets: your agent’s regression suite

You already trust regression suites for code: fixed tests that must stay green after every change. A golden set is the same idea for agent behavior — a versioned collection of representative questions, with expected results, that you re-run after every change to detect improvement or regression. Scout’s lives in evals/golden_set.json: 15 authored questions, committed to the repo, because a golden set is executable specification.

Three design rules make a golden set worth its tokens:

  1. Representative and stratified. Scout’s 15 questions split across four categories — factual, comparative, recent, multi-hop — because aggregate scores hide category-level failures.
  2. Expected points, not expected answers. A research agent has no single correct string — two good reports on one question can share zero sentences. So each entry lists 3–5 facets a good answer must cover, and the judge scores facet coverage, not string similarity.
  3. Versioned like code. Change the questions and every previous score becomes incomparable. Treat edits like schema migrations — deliberate, and visible in the diff.

One entry, in the frozen schema the lab uses (and later modules reuse as-is):

{
  "id": "c02",
  "question": "When should an AI system use a knowledge graph instead of a vector database?",
  "category": "comparative",
  "expected_points": [
    "Vector databases retrieve by semantic similarity over embeddings",
    "Knowledge graphs store entities and typed relationships",
    "Multi-hop relational questions follow edges, which similarity search cannot do",
    "Many production systems combine both"
  ],
  "min_sources": 2
}

Where do new entries come from once you’re live? The best source is structured user feedback — feedback collected through a rubric with specific questions (Was the report on topic? Do you trust the sources? Right depth? What was missing?) rather than a bare thumbs-up. Per the official study guide, objective 3.3 is exactly this loop: a 👍/👎 tells you that something is wrong; a rubric answer like “off topic for comparative questions” tells you what — and converts directly into a new golden-set entry that reproduces the failure, so re-running proves the fix.

Metrics that matter: task success, grounding, cost, latency

With a golden set in hand, what do you measure per run? Domain 3 expects you to pick metrics per use case, so internalize this table — which metric catches what, and what it costs to measure. Several rows lean on an LLM-as-judge — an LLM grading outputs against a rubric, dissected in the next section:

MetricWhat it catchesHow it’s measured
Task success rateThe agent doesn’t achieve the goalLLM-as-judge or human, against the task
GroundingStatements not supported by any sourceLLM-as-judge (sources vs. report)
Citation accuracyMarkers that point at the wrong/no sourceCode check (resolution) + judge (precision)
CoverageExpected facets missing from the answerLLM-as-judge, against expected_points
Cost per runToken burn creeping up per changeCode: count calls/tokens
LatencyRuns drifting from 40s to 4 minutesCode: wall-clock per run
Tool-call efficiencyRedundant searches, skipped stages (trajectory)Code: count/inspect tool calls and node visits

Grounding — the degree to which the output’s statements are supported by the sources actually retrieved — is the flagship metric for a research agent: an ungrounded report is a liability with citations on it. It differs from task success (a grounded report can still miss the question) and from citation accuracy (a well-grounded statement can carry the wrong [n] marker). For a customer-support agent you’d weight task success and latency instead; choosing the metric for the use case is precisely what exam scenarios test.

Now the cost discipline: the table’s “code check” rows cost zero tokens; the judge rows bill a strong model per question. In production you’d gate judge calls on the cheap layer — no tokens spent grading a report whose citations don’t even resolve. Scout’s harness instead judges every run, flunked or not, so all rows of the results schema stay comparable; what it keeps from the discipline is the order. It runs deterministic checks — pure-code assertions on the output: do all [n] citations resolve into sources[]? does every Claim carry a verdict? are there at least min_sources sources? is the length sane? — before any judge call, every time: free catches first, then the paid verdict.

Notice what makes those checks possible: the course’s frozen contracts. Reports cite [n] markers into a typed sources[] array (Module 6 built that), and the Fact-checker fills claims[] with verdicts (Module 7). Because the output is structured, “are the citations broken?” is a regex, not a judgment call — schema discipline pays its rent in evaluation.

LLM-as-judge: power and pitfalls

Deterministic checks can’t tell you whether a report is good — whether its statements follow from the sources, whether it covered what matters. Those are semantic judgments with no single right answer. LLM-as-judge is the technique of using an LLM (usually a stronger one) to score another model’s output against an explicit rubric — it’s how agent outputs get evaluated at scale, because nobody pays humans to read 15 research reports after every prompt tweak.

Three practices separate a usable judge from a random-number generator. First, an anchored rubric: every dimension gets concrete level descriptions (“5: every substantive statement traces to a listed source… 1: invents facts freely”) — “rate 1–5” with no anchors invites drift. Second, structured output: the judge returns a typed object (a Pydantic JudgeScore with grounding, coverage, citations, each 1–5, plus a rationale), so scores parse, validate, and aggregate — and a malformed verdict retries with the validation error instead of silently passing. Third, temperature 0: a measuring instrument should not be creative.

Judges come in two shapes: pointwise (score one output against the rubric) and pairwise (compare two outputs and pick a winner). Pointwise scores trend over time; pairwise is more sensitive for A/B decisions but imports the nastiest bias below. A judge is a model with model failure modes — Zheng et al. measured the first three systematically in the MT-Bench paper:

Judge biasWhat it looks likeMitigation
Position biasIn pairwise mode, the first (or last) answer wins moreJudge both orders; only count agreement
Verbosity biasLonger answers score higher, content equalRubric rewards density; length-cap or normalize
Self-preference biasA judge favors output from its own model/familyJudge ≠ agent’s model — different family or size
Leniency driftScores inflate over time or after judge updatesAnchored rubric; re-judge a fixed sample on every judge change

The course’s choice makes one mitigation structural: Scout runs on nvidia/nemotron-3-nano-30b-a3b, while the judge is nvidia/nemotron-3-super-120b-a12b — a stronger model (better judgment) that is also a different size class (self-preference mitigated by construction). Never let an agent’s own model grade its homework. My working rule: hand-label about 10% of judged samples and check agreement before believing any judge — if you and the judge disagree on 4 reports out of 10, fix the rubric, not the agent.

The ecosystem, briefly: openevals ships ready-made judge evaluators on the LangChain side, and ragas (0.4.x) provides RAG metrics — its faithfulness is essentially our grounding. The lab builds the judge by hand anyway: a hundred-odd lines that teach you what those libraries wrap.

One sentence to tattoo somewhere: an LLM judge is not an objective referee — it’s a measuring instrument, and instruments get calibrated or they get distrusted.

The tuning loop: from scores to targeted fixes

Scores exist to cause correct action, and the first rule of acting on them is: never read only the aggregate. Suppose Scout’s overall grounding is 4.1/5 — fine, apparently. Split by category and you might find 4.6 on factual questions and 2.8 on comparative ones: Scout grounds single-fact answers well but free-styles when synthesizing two positions. Now you know where to act (the Writer’s synthesis behavior, or the Planner’s decomposition of comparative questions) instead of “improve prompts somewhere.” That category-level analysis is objective 3.5 verbatim — analyze evaluation results to guide targeted optimization — and it’s why the golden set was stratified in the first place.

Once you know where, choose the cheapest lever that can plausibly fix it:

Tuning leverCost of changeTypical gainWhen to use
Prompt (instructions, rubric, examples)Minutes; no infraOften largeFirst, almost always
Inference parameters (temperature, max output tokens, reasoning budget)Minutes; no infraModerate; big latency/cost winsAccuracy vs. latency-efficiency trade-offs
Model swap (per node)Hours; re-validate everythingLarge either directionA node under- or over-powered for its job
Architecture (add/remove/rewire nodes)Days; state and evals churnLargeWhen the structure itself is the bottleneck

An exam-critical clarification on row two, because objective 3.4 says “tune model parameters” and trips people up: in Domain 3 that means inference-time parameters — temperature, output budgets, reasoning effort, which model serves which node — and the accuracy vs. latency-efficiency trade-offs they control. It does not mean training or fine-tuning weights. Concretely: moving every Scout node to a larger model might lift grounding by 0.2 and triple your p95 (95th-percentile) latency; the tuned answer is usually a small, fast model for routing turns and the stronger model only where judgment lives.

Then the discipline that turns levers into proof — the loop the lab builds:

flowchart TD
    GS["Golden set<br/>(15 questions, versioned)"] --> RUN["Run Scout<br/>per question"]
    RUN --> DC["Deterministic checks<br/>(free, always first)"]
    DC --> J["LLM-as-judge<br/>grounding · coverage · citations"]
    J --> CAT["Scores by category"]
    CAT --> DECIDE{"Tuning decision:<br/>prompt? params?<br/>model? architecture?"}
    DECIDE -->|"change ONE variable"| RERUN["Re-run the golden set"]
    RERUN --> CMP["Compare vs. baseline<br/>(flag regressions)"]
    CMP --> CAT

Change one variable at a time, re-run the same golden set, compare against the stored baseline, and treat any metric that drops beyond a threshold as a regression to investigate. Two changes per run means zero attributable results — demo-driven development with extra steps.

Hands-on lab: build it

You’ll build the full harness, then use it as intended: prove that one prompt improvement to the Writer raises citation quality without degrading anything else. The complete code lives in module-08/ of the labs repo.

Objective: build Scout’s eval harness (golden set, checks, judge, runner, comparator) and ship a proven prompt fix.

Observable result: at the end, cd module-08 && uv run evals/compare_runs.py evals/results/run-baseline.json evals/results/run-tuned.json prints a per-metric delta table where the overall citations delta is positive and nothing is flagged REGRESSION.

Step 1 — Read the golden set

Open evals/golden_set.json: 15 authored questions in the schema you saw above — stratified across the four categories, with expected_points facets and a min_sources floor per question. Committed, never generated at runtime: a golden set you can’t diff is a golden set you can’t trust.

Step 2 — Deterministic checks

evals/checks.py asserts the frozen contracts on a finished run. Its heart — citation resolution against sources[]:

# The real Writer emits single markers [2] AND combined ones [1,3] —
# both are part of the report contract, so both must parse.
CITATION_PATTERN = re.compile(r"\[(\d+(?:\s*,\s*\d+)*)\]")

def cited_numbers(report: str) -> set[int]:
    """Every source number cited in the report, [2] and [1,3] alike."""
    numbers: set[int] = set()
    for group in CITATION_PATTERN.findall(report):
        numbers.update(int(part) for part in group.split(","))
    return numbers

def _citations_resolve(report: str, sources: list[dict]) -> CheckResult:
    """Every cited [n] must point inside sources[] — the frozen report contract."""
    cited = cited_numbers(report)
    dangling = sorted(n for n in cited if n < 1 or n > len(sources))
    return CheckResult(
        name="citations_resolve",
        passed=not dangling,
        detail=(
            f"all citations resolve into {len(sources)} source(s)"
            if not dangling
            else f"dangling citations {dangling} with only {len(sources)} source(s)"
        ),
    )

The other checks follow the same shape: every claim has a valid verdict (an empty claims[] fails — the bypassed-Fact-checker smell), source count meets min_sources, report length within bounds. Test them against the two committed fixtures — one good report, one with planted defects:

# from the repo root
uv run pytest module-08/tests/ -k checks

The broken fixture must fail exactly four named checks. If your checks pass it, your checks are decorative.

Step 3 — The judge

The score schema and the rubric’s anchoring style, from evals/judge.py:

class JudgeScore(BaseModel):
    """The judge's verdict on one report.

    Field names are FROZEN (06 §2): they flow unchanged into
    evals/results/*.json and are reused later in the series.
    """

    grounding: int = Field(ge=1, le=5)
    coverage: int = Field(ge=1, le=5)
    citations: int = Field(ge=1, le=5)
    rationale: str


RUBRIC = """\
...
GROUNDING — are the report's statements supported by the listed sources?
  5: every substantive statement traces to a listed source; nothing invented
  3: most statements supported; one or two go beyond what the sources say
  1: the report contradicts the sources or invents facts freely
...
Style is NOT a dimension: a short report in bullet points is fine. Combined
markers like [1,3] count as precise citations. A statement honestly flagged
as unverified or contested is good practice — never punish that honesty
under grounding; only punish UNFLAGGED statements the sources do not back.

Answer with ONLY a JSON object — no markdown fences, no commentary — with
exactly these keys:
{"grounding": <1-5>, "coverage": <1-5>, "citations": <1-5>, "rationale": "<two sentences>"}
"""

Without that style paragraph, the judge docks grounding points for the exact honesty the Writer is instructed to produce — flagged unverified claims. A rubric tells your instrument what to punish and what not to.

The call runs config.JUDGE_MODEL at temperature 0 with the same 429 backoff as every NIM call in this course, and a failed Pydantic validation retries once, carrying the error message back to the model — the Module 4 pattern.

Step 4 — Baseline run

evals/run_evals.py ties it together: for each question, run the full Scout graph against a fresh, isolated knowledge base (otherwise the Fact-checker starts retrieving evidence left over from the previous question), print cheap trajectory stats (nodes visited, worker turns, wall-clock), run the free checks, then pay for the judge:

# Deterministic checks first — they are free; the judge bills tokens.
check_results = checks.run_checks(report, sources, claims, entry["min_sources"])
flunked = checks.failed(check_results)

score = judge.judge_report(entry["question"], entry["expected_points"], report, sources)

Module 08 ships the tuned Writer prompt, so restore Module 7’s baseline prompt first, then run:

cd module-08
cp ../module-07/scout/agents/writer.py scout/agents/writer.py
uv run evals/run_evals.py --limit 5 --out evals/results/run-baseline-mine.json

--limit 5 evaluates the first five questions — deliberately one of each category, plus one. Expect several minutes: each question is a full multi-agent run of 10–20 LLM calls under the free tier’s 40 req/min (as of June 2026), plus a judge call. The -mine suffix is deliberate: the committed run-baseline.json / run-tuned.json are the reference 15-question runs my numbers below come from — results are versioned like code, so write yours next to them, not over them.

Step 5 — Tune one variable

Bring back the tuned prompt (git checkout -- scout/agents/writer.py). The Writer’s system prompt gains exactly one block over Module 7’s:

Citation discipline:
- Place each [n] marker on the exact statement it supports — never
  collect the citations at the end of a paragraph.
- The cited source must actually make the statement; never decorate a
  sentence with a number whose source says something else.
- Never cite a number that is not in the numbered source list.
- If no listed source supports a statement, drop it or mark it
  explicitly as unverified — an honest gap beats a decorated guess.
- List each cited source exactly once in References, as '[n] url'.

Nothing else changes. One variable, or the comparison proves nothing.

Step 6 — Re-run and prove it

uv run evals/run_evals.py --limit 5 --out evals/results/run-tuned-mine.json
uv run evals/compare_runs.py evals/results/run-baseline-mine.json evals/results/run-tuned-mine.json

Your command prints Comparing 'baseline-mine' -> 'tuned-mine' over 5-question means. The block below is instead the reference comparison — the committed run-baseline.json / run-tuned.json, full 15-question runs on both prompts:

Comparing 'baseline' -> 'tuned' (model: nvidia/nemotron-3-nano-30b-a3b)

Overall (gated)
  metric       baseline    tuned    delta
  grounding        4.00     4.20    +0.20
  coverage         2.60     2.73    +0.13
  citations        4.07     4.33    +0.26

  ... (comparative and factual tables omitted) ...

Category: multi-hop (diagnostic, small n)
  metric       baseline    tuned    delta
  grounding        4.33     3.00    -1.33
  coverage         2.33     2.33    +0.00
  citations        4.00     3.67    -0.33

  ... (recent table omitted) ...

No regression flagged.

Those are my real numbers (my own 5-question quick pass swung on one category, so I ran the whole set). citations rose +0.26 with grounding and coverage up too — the fix is real, and provably not a trade. Now read the multi-hop diagnostic like an engineer: a -1.33 grounding swing on a 3-question slice traces to one degenerate run (re-judged, that question scores 5/5 on grounding). That is why the REGRESSION gate reads the overall aggregates while the category tables are targeting diagnostics — they say “look at multi-hop next”, not “reject this change”. If your own 5-question delta is flat or one category looks wild, run the full set on both prompts (omit --limit; the baseline half needs Step 4’s cp again first, the tuned half Step 5’s git checkout; fifteen multi-agent runs under 40 req/min is a long lunch — start it and walk away).

Step 7 — Verify the module

Back at the repo root (cd ..):

uv run pytest module-08/tests/                       # offline: schemas, checks, judge contract
SCOUT_LIVE_TESTS=1 uv run pytest module-08/tests/    # + 1 judge call + 1 full eval run

The offline tests validate the golden-set schema, the checks against both fixtures, the judge’s structured-output contract (scripted client), and the comparison logic — zero API calls, so CI stays free. Modules 01–07’s suites stay green on the carried-over code: the repo’s cumulative rule.

Try it yourself (no solution provided):

  1. Make the golden set yours: add 5 questions from your own domain with honest expected_points, re-run the comparison — does the prompt fix hold, or did it overfit to the course’s questions? (While your questions are in place, test_golden_set_schema — exactly 15 questions — is red; expected. Restore the frozen file with git checkout -- evals/golden_set.json.)
  2. Go pairwise: write a judge_pairwise(question, report_a, report_b) variant and call it twice with the order swapped. If the winner flips, you have reproduced position bias on your own instrument — only count a win when both orders agree.

Exam corner

What the exam tests here. Per the official study guide, Domain 3 (13%) has five objectives, each covered in a section of this module:

ObjectiveWhere it’s covered
3.1 Implement evaluation pipelines and task benchmarks”Why agent evaluation…”, “Metrics that matter”, the lab
3.2 Compare agent performance across tasks and datasetsGolden-set stratification + compare_runs
3.3 Collect and integrate structured user feedback”Golden sets” — the feedback → eval-case loop
3.4 Tune model parameters (accuracy, latency-efficiency)“The tuning loop” — the levers table
3.5 Analyze evaluation results to guide targeted optimization”The tuning loop” — scores by category

Quiz — answers after question 5.

  1. A team runs two agents: a customer-support agent resolving tickets in chat, and a research agent producing sourced market briefs. They want to pick primary metrics per agent. Which pairing fits best?

    • A) Support: grounding; research: latency
    • B) Support: task success rate and latency; research: grounding and citation accuracy
    • C) Both: tokens per run — cost is what management reads
    • D) Both: an MMLU-style benchmark on the underlying models
  2. Your pairwise LLM-as-judge compares report A and report B and picks A. You re-run with the two reports in the opposite order, and it picks B. What’s happening, and what’s the correct response?

    • A) The reports are equal; record a tie and move on
    • B) The judge’s temperature is too low; raise it for more decisive verdicts
    • C) Position bias; evaluate both orders and only count a result when the verdicts agree
    • D) Self-preference bias; switch the judge to the same model the agent uses
  3. An agent returns a correct final answer, but the run made 12 tool calls where 4 would do, and the verification stage never executed. Which evaluation approach detects this — and outcome-only evaluation would miss it?

    • A) Trajectory evaluation: inspect the path — node visits, tool-call counts, skipped stages
    • B) A larger golden set: more questions surface more failures
    • C) Pairwise judging against a reference report
    • D) Re-running the same question at temperature 0 to remove nondeterminism
  4. After switching every node of a multi-agent system to a much larger model, quality ticks up slightly but p95 latency triples and costs spike. Within Domain 3’s “tune model parameters,” what’s the best corrective?

    • A) Fine-tune the large model on the team’s data to make it faster
    • B) Keep the large model only where judgment matters; use a small, fast model for routing turns, and cap output/reasoning budgets per node
    • C) Remove the iteration cap so runs can finish in fewer, longer turns
    • D) Disable evaluation runs to recover the spent latency budget
  5. Users flag, through a feedback rubric, that reports on comparative questions are frequently off topic. What’s the correct way to integrate this feedback?

    • A) Immediately rewrite the Planner prompt — users have spoken
    • B) Add the flagged questions to the golden set with expected points, reproduce the low scores, fix one variable, and re-run to prove the gain
    • C) Lower the judge’s coverage threshold so the reports pass again
    • D) Ask users to rephrase comparative questions as factual ones

Answers. 1 — B. Metrics follow the use case: a support agent is judged on resolving the task fast (task success, latency); a research agent on output trustworthiness (grounding, citation accuracy). C measures cost, not quality; D benchmarks the model — the model-vs-agent confusion. 2 — C. A winner that flips with presentation order is position bias, the classic pairwise failure; the mitigation is order-swapping with agreement. D creates self-preference bias rather than fixing anything. 3 — A. Redundant calls and a skipped verification stage are path defects: only trajectory evaluation (node visits, tool-call counts) sees them, because the outcome is correct by assumption. B, C, and D all grade outcomes harder. 4 — B. Domain 3’s “tune model parameters” means inference-time choices and the accuracy vs. latency-efficiency trade-off: right-size the model per node, budget output and reasoning tokens. A is weight training — out of scope; C worsens worst-case latency; D is giving up measurement. 5 — B. Structured feedback’s job is to become evaluation cases: reproduce, fix one variable, re-run, prove. A is a blind fix you can’t verify; C is gaming the instrument; D blames the user.

Traps to avoid:

  • Model benchmark ≠ agent evaluation. MMLU scores and leaderboards grade a model’s knowledge on static data; agent evaluation grades a system on end-to-end tasks, trajectories included, in a live environment. The exam loves offering a model benchmark as a distractor for an agent-quality question.
  • “Tune model parameters” ≠ retraining. In Domain 3, tuning means inference parameters and accuracy/latency-efficiency trade-offs — prompts, temperature, budgets, model selection per node. Fine-tuning weights is a different discipline.
  • Judge scores ≠ ground truth. Without human-labeled calibration samples and bias mitigations, an LLM judge’s scores are precise-looking noise. Any answer that treats judge output as objective truth is bait.
  • A bare 👍/👎 is not structured feedback. Objectives 3.3 and 10.2 mean feedback a pipeline can act on — rubric dimensions, missing points, wrong citations — routed into the golden set. A thumbs-down says that something failed, never what.

Key takeaways

  • Agent evaluation is not model evaluation: it grades an end-to-end task in a live environment, trajectories included — a great model can still power a broken agent.
  • A golden set is your agent’s regression suite: stratified, versioned, expected points rather than answers — and grown from structured user feedback.
  • Deterministic checks run before the judge: free code catches broken citations and missing verdicts before any judge token is spent.
  • An LLM judge is an instrument, not a referee: anchored rubric, typed output, temperature 0, a stronger different model — calibrated against human labels.
  • Evaluate outcome and trajectory: a beautiful report can hide a skipped Fact-checker and twelve redundant tool calls.
  • Read scores by category — aggregates hide exactly the failures you need to find.
  • Tune one variable at a time and prove it: baseline, change, re-run, compare, flag regressions.

Keep going

Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.

Scout reads arbitrary web content — which means arbitrary web content can talk to Scout. Next module: guardrails against prompt injection, and a human approval gate on the research plan.

Lab code · Course index · ← Module 7 · Module 9 →

References