Deploying AI Agents: From Notebook to Production API (NCP-AAI Module 10)
This is Module 10 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.
The demo lands, and someone from the platform team says the words that
end every successful agent demo: “Great — can you give me an endpoint?”
You do the obvious thing: wrap graph.invoke() in a POST handler and
ship it. The first caller’s HTTP client gives up at thirty seconds; the
run needs three minutes. The second request reaches the plan-approval
interrupt and parks a server worker on a human who has no way to answer —
the request is the only channel, and it’s busy hanging. At five o’clock
you deploy a one-line fix; the restart kills three runs mid-flight,
tokens already paid.
None of this is bad luck. An agent is not a function — it’s a minutes-long process with a human in the middle. This module gives Scout the deployment shape that fact demands: an asynchronous job API, approval over HTTP, all of it in a Docker image you can run anywhere.
In this module
- You’ll learn:
- Design an asynchronous job API for a long-running agent — submit, poll, approve — instead of a naive request-response endpoint.
- Expose the Module 9 human-in-the-loop interrupt over HTTP: the plan waits; approval resumes the graph through the checkpointer.
- Containerize the service with Docker: reproducible image, secrets injected at runtime, health checks.
- Scale deliberately: workers vs. queues, latency vs. throughput, and finding the real bottleneck under load.
- Optimize deployment costs: hosted endpoints vs. self-hosted GPU, the break-even logic, and baseline high availability.
- You’ll build: Scout behind a FastAPI service —
POST /research,GET /status/{job_id}, plan approval over HTTP — all in a Docker image. - Exam domains covered: D4 — Deployment and Scaling — 13% of the exam.
- Prerequisites: Modules 1–9 (Scout
runs with guardrails and the plan-approval interrupt); NVIDIA and
Tavily keys configured. New setup: Docker Desktop (or Engine)
installed —
docker --versionshould answer.
Where you are
- ✅ Modules 1–6 — first NIM call, ReAct, architecture, planning, memory, RAG
- ✅ Modules 7–8 — the supervisor team; the eval harness and LLM-as-judge
- ✅ Module 9 — guardrails + the human approval interrupt on the plan
- 👉 Module 10 — Deployment (you are here)
- ⬜ Modules 11–14 — observability, the NVIDIA stack, capstone, the exam
Scout before: a complete, safe, evaluated multi-agent system — that only runs when you type a command in a terminal. Scout after: a containerized service that accepts research jobs over HTTP, pauses for plan approval, resumes on a verdict, and whose capacity and costs you can reason about. The graph doesn’t change by a single line — that’s the discipline this module enforces.
From script to service: what changes when an agent goes online
A web framework’s mental model is the function call: request in, response out, milliseconds in between. Proxies, load balancers (the machines that spread incoming requests across servers), and client libraries are all tuned to that model, with idle timeouts commonly in the 30–60 second range. A Scout run takes two to five minutes and stops midway to wait for a human. Three properties of agent runs break request-response:
- Runs outlast requests. Minutes of work can’t live inside one HTTP request — something times out, and the work is orphaned.
- A human sits mid-run. Module 9’s interrupt pauses the graph until someone approves the plan — possibly hours later, so approval must arrive on its own request.
- Processes die. Deploys, crashes, out-of-memory (OOM) kills. If run state lives in the process, every restart burns every in-flight job and its tokens.
The answer is the job pattern: asynchronous execution — the server
accepts work and returns immediately; the work completes on its own
schedule, decoupled from any request. The client’s contract becomes:
submit (get a job_id back in milliseconds, with HTTP 202 — “accepted,
not done”), poll (ask for status), resume (deliver the verdict as a
separate call). Pending work waits in a job queue — here an
in-process task list, in heavier systems a dedicated broker — adding a
knob the synchronous model never had: how many runs execute
concurrently, however many arrive. In this lab the knob stays at its
implicit default — the event loop’s worker thread pool caps concurrency;
a dedicated queue is where you’d size it deliberately.
The build is cheap because of a decision you already made in
Module 5: the checkpointer persists the
full ScoutState after every super-step, keyed by thread_id. Set
thread_id = job_id and the server holds no run content — not the
plan, not the transcript, not the half-written report. That’s
statelessness: any process that can reach the persistence layer can
serve any request about any job — restart, replace, or multiply servers
without losing work. The checkpointer was sold to you as a memory
feature; it turns out to be the load-bearing wall of the deployment.
Designing the agent API: jobs, status, and human approval
Scout’s wire contract is three routes, frozen for the rest of the course — Modules 11 and 13 build against them unchanged:
| Route | Body | Returns |
|---|---|---|
POST /research | {question} (+ optional Idempotency-Key header) | 202 {job_id} |
GET /status/{job_id} | — | status, plan, report, error |
POST /research/{job_id}/approve | {approved: bool, edited_plan?: ResearchPlan} | job view |
The implementation is FastAPI — the Python web framework the service is built on; uvicorn is the server process that runs the app.
Behind the routes sits a state machine — every job is in exactly one of six states, and the API’s real job is to enforce the legal transitions:
stateDiagram-v2
[*] --> pending: POST /research → 202 + job_id
pending --> awaiting_approval: Planner done — interrupt() fires
pending --> failed: planning error
awaiting_approval --> running: approve {approved true}
awaiting_approval --> failed: approve {approved false}
running --> done: cited report written
running --> partial: sources gathered, no full report (Module 13)
running --> failed: unrecoverable error
done --> [*]
partial --> [*]
failed --> [*]
The job lifecycle: pending covers planning; running is post-approval
research — clients tell “drafting” from “researching” by status alone.
The partial arrow is contract, not yet behavior: this module maps a
no-report run to failed; Module 13 first produces partial.
The branch that makes this an agent API rather than a generic task API
is awaiting_approval. When the Module 9 interrupt fires, the graph
checkpoints itself and stops; the service marks the job
awaiting_approval and exposes the drafted plan in the status payload.
The reviewer calls the approve route with the payload Module 9’s resume
already speaks: {approved: bool, edited_plan?: ResearchPlan}. Approval
resumes the graph from its checkpoint via Command(resume=...); an
edited plan replaces the draft; a rejection terminates the job. The API
adds transport, never semantics — if a deployment layer changes what the
agent does, it has overstepped.
Two contract details earn their place now and get hardened in Module 13.
First, partial: a run that gathered sources but couldn’t finish a
report deserves a status that says so — frozen into the vocabulary
today, set for the first time in Module 13; until then the lab returns
failed for that case. Second, the Idempotency-Key
header — idempotency is the property that performing the same
operation twice has the same effect as once. Clients retry and networks
duplicate; without the key, a retried POST /research launches a second
token-burning run — with it, the same key returns the same job_id and
the duplicate costs nothing.
One alternative to polling deserves a mention, because objective 2.5
names it: streaming — pushing incremental results over a held-open
channel (Server-Sent Events or WebSocket) instead of waiting to be asked.
You built the in-graph version in Module 2
(graph.stream()); the service-level version pushes status changes and
report tokens. Better experience, heavier lift — held-open connections
complicate load balancing and timeouts. Polling is the contract;
streaming is the upgrade, left as a lab exercise.
Containerization: Docker for agent services
Containerization packages the application with its entire runtime — interpreter, dependencies, system libraries — into one immutable image that runs identically on any machine with a container engine. “Works on my machine” is not a deployment strategy for a system whose behavior depends on pinned versions and prompts.
The lab’s Dockerfile makes five decisions that generalize to every
agent service you’ll ship:
- Pinned base image (
python:3.12-slim) — an implicitlatestis a deployment that changes under your feet. - Lockfile-driven installs (
uv sync --frozen) — the image contains exactly whatuv.locksays, or the build fails. - Non-root user — if the service is compromised (Module 9: Scout reads the open web), the blast radius is a deprivileged user in a sandbox.
HEALTHCHECKprobing a cheap/healthroute — no LLM call. The engine learns whether the service is alive; a load balancer later reuses the probe to route around dead replicas.- Secrets at runtime, never in the image. The NVIDIA key arrives via
--env-fileat start; anything baked into an image is readable by anyone who can pull it, forever.
One container on one host is still one process on one machine — no extra capacity, no zero-downtime updates, no recovery when the host dies. That’s the orchestration layer’s job, and objective 4.4 wants you to know what Kubernetes actually adds: replicas (N copies, scheduled across machines), load balancing — distributing incoming requests across replicas so no instance saturates while others idle — autoscaling (replica count follows load), rolling updates (replace replicas one at a time; zero downtime), and self-healing (failed health checks trigger replacement). What’s not on the list: nothing about Kubernetes makes your LLM endpoint answer faster or lift its rate limit. Hold that thought.
| Single Docker host | Docker Compose | Kubernetes | |
|---|---|---|---|
| When | One service, one machine — this lab | A few cooperating containers (API + Redis), one machine | Many replicas, many machines, uptime requirements |
| What you gain | Reproducible runtime, isolation, portability | Multi-container wiring in one file, one docker compose up | Replicas, load balancing, autoscaling, rolling updates, self-healing |
| What you pay | No high availability (HA), no scaling — the host is the ceiling | Still one host, no HA — wiring, not orchestration | A control plane (the cluster’s own management services) to run and learn; real ops investment |
The honest progression: stay left as long as you can, move right on evidence — Module 3’s discipline, one layer down the stack.
Scaling and sizing: workers, queues, and rate limits
Two words the exam will make you tell apart, because most wrong answers in Domain 4 confuse them. Latency is how long one request takes — you track p50 and p95 (the time under which 50% and 95% of requests finish). Throughput is how many requests the system completes per unit of time — for Scout, jobs per minute. They move independently: adding replicas can double throughput while each job’s latency stays exactly where it was. “Fast for one user, collapses under many” is a throughput problem; “every request is slow, even alone” is latency. Name which one you’re fixing before touching anything.
Three execution tiers, in increasing operational weight:
| In-process background tasks | Worker pool (more uvicorn workers) | External queue (Redis/Celery-style) | |
|---|---|---|---|
| What it is | The API process runs jobs on its own threads — this lab | N API processes behind one socket | Broker holds jobs; dedicated workers consume them |
| Survives a process death? | No — in-flight jobs lost (graph state survives in the checkpointer) | No — same, per worker | Yes — queued jobs persist, get re-delivered |
| Scales horizontally? | No — capacity is one process | One machine’s worth | Yes — add workers on any machine |
| Complexity | None — just code | One flag | A broker to run, monitor, version |
The lab stays in column one — then makes you measure why columns two and three wouldn’t help this service yet. Run the load script with five concurrent jobs and watch what saturates:
flowchart LR
C[clients] -->|POST /research<br/>GET /status| A[FastAPI service<br/>ms responses, idle CPU]
A -->|job runner<br/>worker threads| G[LangGraph graph<br/>supervisor + specialists]
G <--> K[(SQLite checkpointer<br/>thread_id = job_id)]
G -->|every LLM call| N[NIM endpoint<br/>build.nvidia.com]
N -. "40 req/min per account<br/>(as of June 2026)<br/>⬅ THE bottleneck" .-> G
Five concurrent runs, one rate-limited account: LLM calls queue at the endpoint while the web tier serves polls in milliseconds.
A Scout job spends its life making LLM calls — the Planner, supervisor
turns, the specialists, the Writer, Module 9’s rails — against one endpoint
rate-limited at 40 requests per minute per account (as of June 2026).
Five concurrent jobs already contend for that budget;
scout/llm.py backs off on 429s — short bursts stretch the run,
sustained saturation can still fail a job — while the API process naps
between polls. “Add more API workers,” the
exam’s favorite trap, adds capacity where spare capacity already
exists. When the bottleneck is the model endpoint, throughput comes from
levers that act on model calls — pacing work through a queue sized to the
rate limit, batching or caching, a second account or endpoint, or a
self-hosted NIM whose limits are your hardware’s.
Which is why profiling under load (objective 4.3) comes before scaling decisions: measure p50/p95 per phase, jobs per minute, error rate, and where time accumulates — sizing without a profile is guessing with a budget.
The economics of deployment: hosted endpoints vs. self-hosted NIM
Every LLM call Scout has made since Module 1 hit a hosted endpoint — build.nvidia.com runs the GPUs; you pay per use. The alternative is self-hosting the same model as a NIM container on a GPU you rent or own. Objective 4.5 wants the decision framework, not vendor loyalty:
| Hosted endpoint (build.nvidia.com-style) | Self-hosted NIM on a rented GPU | |
|---|---|---|
| Cost model | Variable — per request/token; zero at zero traffic | Fixed — an L4 GPU rented on NVIDIA Brev (NVIDIA’s GPU-rental service) runs ~$0.44–0.80/hour (as of June 2026): ~$320–580/month, traffic or not |
| Rate limits | The provider’s (40 req/min here, as of June 2026) | Your hardware’s — you size it |
| Data residency | Prompts and outputs transit the provider | Everything stays in your network |
| Latency control | None — shared infrastructure | Yours to tune (and Module 12 shows the NVIDIA tooling for it) |
| Ops burden | None — it’s an API | Yours: driver stacks, model updates, monitoring, capacity |
| HA story | Provider’s problem | ≥2 GPUs — the fixed cost doubles |
The break-even logic is arithmetic plus two vetoes. Arithmetic: at low or spiky volume, pay-per-use wins — a GPU billing ~$500/month to serve 200 requests a day is mostly paid idleness; at high sustained volume the fixed cost amortizes and self-hosting wins. The vetoes override the math in either direction: a data residency requirement forces self-host; a zero-ops-team reality forces hosted — each regardless of volume.
Whatever you choose, production adds a floor: high availability (HA)
— the property that the service survives the failure of any single component,
achieved with redundancy (at least two replicas of anything that can
die), health checks to detect the death, and load balancing plus client
retries to route around it. Your earlier work pays in: stateless API
replicas are trivially duplicable because run state lives in the
checkpointer, and the /health route exists because a load balancer
will need it. HA is decisions made early, cheap — or retrofit late,
expensive.
Last piece of objective 4.2’s deployment half: CI/CD — continuous integration/continuous deployment, the automated build-test-deploy pipeline. The pipeline that catches real regressions, in order: offline smoke tests (seconds, no API calls), image build (the artifact you’ll run), then the agent-specific gate — Module 8’s regression evals on the golden set, judge scores compared against the last accepted run. A prompt tweak that tanks grounding will pass your unit tests; only the eval gate catches it before users do. Deploy only on green. The other half of 4.2 — monitoring and governance once traffic flows — is Module 11’s territory.
Hands-on lab: build it
The full lab lives in
module-10/
of the labs repo; these are the load-bearing excerpts.
Objective: expose the Module 9 Scout as a containerized FastAPI service with the full job lifecycle, plan approval included.
Observable result: curl -X POST /research returns a job_id
immediately; polling shows the plan under awaiting_approval; one
approve call later, the status walks through running to done with
the cited report — and the same scenario works against docker run.
The graph: zero lines changed.
Step 1 — The service layer: a job is a thread
scout/api/jobs.py builds the graph once per process, on the Module 5
checkpointer, and pins the one equality everything depends on:
# module-10/scout/api/jobs.py (excerpt)
@lru_cache(maxsize=1)
def get_graph():
"""One compiled graph per process, on the module-05 SQLite checkpointer."""
return build_graph(checkpointer=memory.get_checkpointer())
def _config(job: Job) -> dict:
# thread_id == job_id: a job IS a graph thread. This single equality
# keeps the server stateless — any process holding the checkpointer
# can poll or resume any job.
return {"configurable": {"thread_id": job.job_id, "user_id": API_USER_ID}}
API_USER_ID deserves a glance: every API client shares one anonymous
Module 5 memory namespace until authentication arrives — Module 13
territory. Job metadata (status, error) lives in an in-memory dict — an
honest shortcut; the “In production” box names the real fix.
Step 2 — POST /research: accept, don’t block
# module-10/scout/api/main.py (excerpt)
@app.post("/research", status_code=202, response_model=JobCreated)
async def submit_research(
body: ResearchRequest,
idempotency_key: str | None = Header(default=None, alias="Idempotency-Key"),
) -> JobCreated:
"""Accept the job and return immediately. 202 is the honest status
code — "accepted, not done" — for a run that takes minutes."""
job, created = jobs.create_job(body.question, idempotency_key)
if created:
_spawn(jobs.run_job, job.job_id) # asyncio.to_thread, off the event loop
return JobCreated(job_id=job.job_id)
Phase one — submission to interrupt — runs in run_job:
# module-10/scout/api/jobs.py (excerpt)
def run_job(job_id: str) -> None:
job = JOBS[job_id]
try:
result = get_graph().invoke(initial_state(job.question), _config(job))
except Exception as exc:
# Recorded, not swallowed: an unhandled crash in a background task
# is a job that looks alive forever from the client's side.
job.status, job.error = "failed", f"{type(exc).__name__}: {exc}"
return
if "__interrupt__" in result:
job.status = "awaiting_approval" # a human's turn
else:
_finish(job, result)
Step 3 — Status: registry + checkpointer, merged
GET /status/{job_id} reads metadata from the registry and content
(plan, report) from the graph state — a sync def, so the SQLite read
runs on FastAPI’s threadpool and never blocks the event loop. The report
is exposed only once the status says done.
Step 4 — Approval over HTTP
The approve endpoint enforces the state machine (404 unknown, 409 if there’s nothing to approve), then delivers the verdict with the exact Module 9 resume payload:
# module-10/scout/api/jobs.py (excerpt)
result = get_graph().invoke(
Command(resume={"approved": approved, "edited_plan": edited_plan}),
_config(job),
)
Run the lifecycle end to end:
cd module-10 && uv run uvicorn scout.api.main:app --port 8000
# in another terminal:
curl -s -X POST http://127.0.0.1:8000/research \
-H "Content-Type: application/json" -H "Idempotency-Key: demo-001" \
-d '{"question": "What is the Nemotron Coalition announced at GTC 2026?"}'
# {"job_id":"job-fa4e1401a262"} — poll it, approve it, poll it again:
curl -s http://127.0.0.1:8000/status/job-fa4e1401a262 # pending… then awaiting_approval + plan
curl -s -X POST http://127.0.0.1:8000/research/job-fa4e1401a262/approve \
-H "Content-Type: application/json" -d '{"approved": true}'
curl -s http://127.0.0.1:8000/status/job-fa4e1401a262 # running → done + report
Step 5 — The Dockerfile
# module-10/Dockerfile (excerpt)
# Explicit base tag — an implicit "latest" changes under your feet.
# Stage 1 — builder: a C++ toolchain, because one transitive dependency
# (annoy, via nemoguardrails) ships no prebuilt wheel and compiles from
# source. The toolchain never reaches the runtime image.
FROM python:3.12-slim AS builder
COPY --from=ghcr.io/astral-sh/uv:0.5 /uv /uvx /bin/
RUN apt-get update \
&& apt-get install -y --no-install-recommends build-essential \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev # the lockfile IS the contract
# Stage 2 — runtime: same pinned base, no compilers, no uv. Only the
# ready-made virtualenv and the code cross the stage boundary.
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /app/.venv /app/.venv
COPY module-10/scout/ scout/
# Non-root: a container that doesn't need root shouldn't have it.
RUN useradd --create-home --uid 1000 scout && chown -R scout:scout /app
USER scout
ENV PATH="/app/.venv/bin:$PATH"
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/health', timeout=2)"
CMD ["uvicorn", "scout.api.main:app", "--host", "0.0.0.0", "--port", "8000"]
Stop the Step 4 uvicorn server first (Ctrl-C) — both want port 8000:
docker build -t scout-api -f module-10/Dockerfile . # from the repo root
docker run --rm -p 8000:8000 --env-file .env scout-api
The keys arrive at runtime via --env-file — never COPYed, never
ENVed, never in a layer (the cached filesystem snapshots an image is
built from); .dockerignore keeps .env out of the build context (the
set of files sent to the builder — here the repo root) entirely. Since
the context is the repo root, the copy BuildKit actually reads is
Dockerfile.dockerignore, next to the Dockerfile — see the lab. Re-run
the Step 4 curl scenario against the container: same behavior,
reproducible anywhere Docker runs.
Step 6 — Load-test it, gently
With the API (or the container) still listening on :8000, from the repo
root:
uv run python module-10/scripts/load_test.py --jobs 3
job status plan research total
job-f1a5316c9d3d done 30.1s 98.4s 128.6s
job-7d01e702ab6a done 28.1s 118.5s 146.6s
job-274bfab9c292 done 36.2s 90.4s 126.6s
completed: 3/3 wall time: 146.7s
throughput: 1.23 jobs/min
GET /status latency: p50 7 ms p95 12 ms
(Your numbers will differ; the shape won’t.) Per-job time stretches as
concurrency grows — the jobs queue on the endpoint’s 40 req/min — while
GET /status stays at single-digit milliseconds. The bottleneck is the
model endpoint; no number of API workers would move that throughput line.
Verify — offline tests drive a scripted fake graph through the real routes; the live marker runs one real job end to end:
uv run pytest module-10/tests/ # offline, no API calls
SCOUT_LIVE_TESTS=1 uv run pytest module-10/tests/ # + 1 real research job
Try it yourself (no solution provided):
- Edit the plan over HTTP. While a job is
awaiting_approval, GET its plan, change one step’ssearch_queries, POST it back asedited_planwithapproved: true. Confirm via status that the graph executed your plan. - Streaming status with SSE. Add
GET /research/{job_id}/eventspushing status changes as Server-Sent Events (StreamingResponse,media_type="text/event-stream"). Keep the polling route: SSE is the upgrade, polling is the contract.
Exam corner
What the exam tests here. Per the official blueprint, Deployment and
Scaling carries 13% of the exam. The study guide’s objectives: 4.1
deploy and orchestrate multi-agent systems at production scale — the job
API and lifecycle; 4.2 apply MLOps (ML operations) practices for CI/CD, monitoring,
and governance — this module owns the CI/CD half (smoke tests, image
build, eval gate); the monitoring half belongs to Module 11, and
questions usually signal which half with “before release” vs. “in
operation”; 4.3 profile performance and reliability under distributed
system loads — p50/p95, jobs/min, finding the bottleneck; 4.4 scale
deployments using containerization (Docker, Kubernetes) with load
balancing; 4.5 optimize deployment costs while ensuring high
availability. This module also reinforces 2.5 (dynamic conversation
flows with real-time streaming): the mechanics are Module 2’s
graph.stream(); here you saw the service-level version in concept.
Quiz — answers after question 5.
-
A research agent takes 2–5 minutes per run and pauses mid-run for a human to approve its plan. Browser and script clients must trigger runs and deliver approvals. Which exposure pattern fits?
- A) A synchronous POST with the server timeout raised to 10 minutes
- B) An async job API: POST returns a
job_id(202), clients poll a status route, approval is its own endpoint resuming the run from persisted state - C) An hourly batch: queue questions in a file, email reports
- D) A WebSocket held open for the whole run, approval delivered over the same socket
-
A service meets its p95 latency target with 10 concurrent users. At 200, status requests time out; the single API container’s CPU is pegged while the LLM endpoint sits well under its rate limit. The right lever?
- A) Switch to a larger, more capable model
- B) Raise the model’s token budget per call
- C) Add API replicas behind a load balancer
- D) Tell clients to poll less often
-
A team on a single Docker host needs zero-downtime updates, automatic replacement of crashed containers, and scaling across machines — and keeps getting 429s from its hosted LLM endpoint. Which statement is correct?
- A) Moving to Kubernetes addresses all four needs, including the 429s
- B) Kubernetes adds rolling updates, self-healing, autoscaling, and load balancing — but the 429s persist: an external endpoint’s rate limit is independent of your container orchestration
- C) Docker alone already load-balances across hosts
- D) Kubernetes is required to run more than one container of an image
-
An internal agent serves ~200 requests/day in unpredictable bursts. No data-residency constraint, no ops team; the service must survive single-component failures. The most cost-sound setup?
- A) Hosted model endpoints (pay-per-use) with at least two stateless API replicas behind a load balancer
- B) One self-hosted GPU running a NIM, to control latency
- C) Two self-hosted GPUs running NIMs, for model-side high availability
- D) The largest hosted model available, for maximum reliability
-
Which CI/CD pipeline is correctly designed for an agent service?
- A) Live end-to-end agent runs on every commit, deploy when green
- B) Build the image, deploy it, then evaluate on production traffic
- C) Deploy on merge; production monitoring catches regressions
- D) Offline smoke tests, then image build, then a regression-eval gate on the golden set, then deploy
Answers. 1 — B. Minutes-long work plus a mid-run human rules out anything that couples the run to a connection: A dies at every proxy and loses work on redeploy; D deadlocks when the approver isn’t at the socket; C ignores interactive approval. Submit / poll / resume is the 4.1 pattern. 2 — C. The profile says the web tier is saturated and the model isn’t — add capacity where the system lacks it: replicas + load balancing. A and B act on the model (not the bottleneck); D sheds load instead of serving it. 3 — B. The orchestration layer buys replicas, load balancing, autoscaling, rolling updates, self-healing. An external endpoint’s rate limit survives any amount of Kubernetes — that lever is queues, pacing, batching, another endpoint, or self-hosting. 4 — A. Low, spiky volume is the textbook pay-per-use case: a ~$320–580/month GPU (L4 on Brev, as of June 2026) idling on 200 requests a day fails the arithmetic; C doubles the failure. HA still applies to your tier — two stateless replicas are cheap because state lives in the persistence layer. 5 — D. Cheap, deterministic checks first; build the artifact you’ll ship; gate on the regression evals that catch what unit tests can’t; deploy last. A burns tokens on every commit for flaky signal; B and C discover regressions in production — what the gate exists to prevent.
Traps to avoid:
- Latency vs. throughput. A faster model improves latency, not necessarily throughput; more replicas improve throughput, not latency. Scenarios quoting both “slow for everyone” and “collapses under load” test whether you treat them as one disease.
- “Add more workers.” More API workers help only where the system has idle capacity to use. When the bottleneck is the LLM endpoint’s rate limit, the levers are queues/pacing, batching, caching, or another endpoint — never the web tier.
- “Docker = scalable.” A container is packaging. Replication, load balancing, autoscaling, and rolling updates come from the orchestration layer — a single Docker host has none of them.
- “Cheapest” ≠ “cost-effective” (4.5). High availability applies to your stateless tier — two cheap replicas — while an idle self-hosted GPU fails the arithmetic. Rank the options on the scenario’s qualifier, not on the sticker price.
Key takeaways
- An agent behind an API is a job, not a function: submit returns a
job_id(202), polling reads status, approval is its own request — runs outlast connections, and humans sit mid-run. - The checkpointer is the deployment backbone:
thread_id = job_idmakes the server stateless, restarts harmless, and any replica able to resume any job. - The Module 9 interrupt crosses the API unchanged:
awaiting_approvalexposes the plan;{approved, edited_plan?}resumes the graph. The API adds transport, never semantics. - Docker gives reproducibility and isolation — not scaling. Replicas, load balancing, autoscaling, and rolling updates come from the orchestration layer.
- The bottleneck of an agent service is almost always the LLM endpoint and its rate limit — profile first; API workers add no throughput there.
- Latency is how long one request takes (p50/p95); throughput is how many complete per minute. Different diseases, different cures.
- Hosted vs. self-hosted is an equation — variable cost vs. fixed, overridden by data residency and ops capacity — and high availability means ≥2 of anything that can die, with health checks and retries.
Keep going
Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.
Scout is online — but is it healthy? Next module: tracing every node, cost per request, and quality alerts when the eval score drifts.
Lab code · Course index · ← Module 9 · Module 11 →
References
- NCP-AAI certification page — the official blueprint; Deployment and Scaling is weighted at 13%.
- Background Tasks — FastAPI on running work after the response, and when to graduate to a real queue.
- LangGraph persistence — checkpointers, threads, resuming — the mechanics statelessness stands on.
- Multi-stage builds — Docker’s guide to lean images; the pattern the lab’s builder/runtime Dockerfile applies.
- Kubernetes Glossary — NVIDIA on container orchestration, scaling, and load balancing; an official study-guide reading for Domain 4.