Deploying AI Agents: From Notebook to Production API (NCP-AAI Module 10)

This is Module 10 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.

The demo lands, and someone from the platform team says the words that end every successful agent demo: “Great — can you give me an endpoint?” You do the obvious thing: wrap graph.invoke() in a POST handler and ship it. The first caller’s HTTP client gives up at thirty seconds; the run needs three minutes. The second request reaches the plan-approval interrupt and parks a server worker on a human who has no way to answer — the request is the only channel, and it’s busy hanging. At five o’clock you deploy a one-line fix; the restart kills three runs mid-flight, tokens already paid.

None of this is bad luck. An agent is not a function — it’s a minutes-long process with a human in the middle. This module gives Scout the deployment shape that fact demands: an asynchronous job API, approval over HTTP, all of it in a Docker image you can run anywhere.

In this module

You’ll learn:
- Design an asynchronous job API for a long-running agent — submit, poll, approve — instead of a naive request-response endpoint.
- Expose the Module 9 human-in-the-loop interrupt over HTTP: the plan waits; approval resumes the graph through the checkpointer.
- Containerize the service with Docker: reproducible image, secrets injected at runtime, health checks.
- Scale deliberately: workers vs. queues, latency vs. throughput, and finding the real bottleneck under load.
- Optimize deployment costs: hosted endpoints vs. self-hosted GPU, the break-even logic, and baseline high availability.
You’ll build: Scout behind a FastAPI service — POST /research, GET /status/{job_id}, plan approval over HTTP — all in a Docker image.
Exam domains covered: D4 — Deployment and Scaling — 13% of the exam.
Prerequisites: Modules 1–9 (Scout runs with guardrails and the plan-approval interrupt); NVIDIA and Tavily keys configured. New setup: Docker Desktop (or Engine) installed — docker --version should answer.

Where you are

✅ Modules 1–6 — first NIM call, ReAct, architecture, planning, memory, RAG
✅ Modules 7–8 — the supervisor team; the eval harness and LLM-as-judge
✅ Module 9 — guardrails + the human approval interrupt on the plan
👉 Module 10 — Deployment (you are here)
⬜ Modules 11–14 — observability, the NVIDIA stack, capstone, the exam

Scout before: a complete, safe, evaluated multi-agent system — that only runs when you type a command in a terminal. Scout after: a containerized service that accepts research jobs over HTTP, pauses for plan approval, resumes on a verdict, and whose capacity and costs you can reason about. The graph doesn’t change by a single line — that’s the discipline this module enforces.

From script to service: what changes when an agent goes online

A web framework’s mental model is the function call: request in, response out, milliseconds in between. Proxies, load balancers (the machines that spread incoming requests across servers), and client libraries are all tuned to that model, with idle timeouts commonly in the 30–60 second range. A Scout run takes two to five minutes and stops midway to wait for a human. Three properties of agent runs break request-response:

Runs outlast requests. Minutes of work can’t live inside one HTTP request — something times out, and the work is orphaned.
A human sits mid-run. Module 9’s interrupt pauses the graph until someone approves the plan — possibly hours later, so approval must arrive on its own request.
Processes die. Deploys, crashes, out-of-memory (OOM) kills. If run state lives in the process, every restart burns every in-flight job and its tokens.

The answer is the job pattern: asynchronous execution — the server accepts work and returns immediately; the work completes on its own schedule, decoupled from any request. The client’s contract becomes: submit (get a job_id back in milliseconds, with HTTP 202 — “accepted, not done”), poll (ask for status), resume (deliver the verdict as a separate call). Pending work waits in a job queue — here an in-process task list, in heavier systems a dedicated broker — adding a knob the synchronous model never had: how many runs execute concurrently, however many arrive. In this lab the knob stays at its implicit default — the event loop’s worker thread pool caps concurrency; a dedicated queue is where you’d size it deliberately.

The build is cheap because of a decision you already made in Module 5: the checkpointer persists the full ScoutState after every super-step, keyed by thread_id. Set thread_id = job_id and the server holds no run content — not the plan, not the transcript, not the half-written report. That’s statelessness: any process that can reach the persistence layer can serve any request about any job — restart, replace, or multiply servers without losing work. The checkpointer was sold to you as a memory feature; it turns out to be the load-bearing wall of the deployment.

Designing the agent API: jobs, status, and human approval

Scout’s wire contract is three routes, frozen for the rest of the course — Modules 11 and 13 build against them unchanged:

Route	Body	Returns
`POST /research`	`{question}` (+ optional `Idempotency-Key` header)	`202 {job_id}`
`GET /status/{job_id}`	—	status, plan, report, error
`POST /research/{job_id}/approve`	`{approved: bool, edited_plan?: ResearchPlan}`	job view

The implementation is FastAPI — the Python web framework the service is built on; uvicorn is the server process that runs the app.

Behind the routes sits a state machine — every job is in exactly one of six states, and the API’s real job is to enforce the legal transitions:

stateDiagram-v2
    [*] --> pending: POST /research → 202 + job_id
    pending --> awaiting_approval: Planner done — interrupt() fires
    pending --> failed: planning error
    awaiting_approval --> running: approve {approved true}
    awaiting_approval --> failed: approve {approved false}
    running --> done: cited report written
    running --> partial: sources gathered, no full report (Module 13)
    running --> failed: unrecoverable error
    done --> [*]
    partial --> [*]
    failed --> [*]

The job lifecycle: pending covers planning; running is post-approval research — clients tell “drafting” from “researching” by status alone. The partial arrow is contract, not yet behavior: this module maps a no-report run to failed; Module 13 first produces partial.

The branch that makes this an agent API rather than a generic task API is awaiting_approval. When the Module 9 interrupt fires, the graph checkpoints itself and stops; the service marks the job awaiting_approval and exposes the drafted plan in the status payload. The reviewer calls the approve route with the payload Module 9’s resume already speaks: {approved: bool, edited_plan?: ResearchPlan}. Approval resumes the graph from its checkpoint via Command(resume=...); an edited plan replaces the draft; a rejection terminates the job. The API adds transport, never semantics — if a deployment layer changes what the agent does, it has overstepped.

Two contract details earn their place now and get hardened in Module 13. First, partial: a run that gathered sources but couldn’t finish a report deserves a status that says so — frozen into the vocabulary today, set for the first time in Module 13; until then the lab returns failed for that case. Second, the Idempotency-Key header — idempotency is the property that performing the same operation twice has the same effect as once. Clients retry and networks duplicate; without the key, a retried POST /research launches a second token-burning run — with it, the same key returns the same job_id and the duplicate costs nothing.

One alternative to polling deserves a mention, because objective 2.5 names it: streaming — pushing incremental results over a held-open channel (Server-Sent Events or WebSocket) instead of waiting to be asked. You built the in-graph version in Module 2 (graph.stream()); the service-level version pushes status changes and report tokens. Better experience, heavier lift — held-open connections complicate load balancing and timeouts. Polling is the contract; streaming is the upgrade, left as a lab exercise.

Containerization: Docker for agent services

Containerization packages the application with its entire runtime — interpreter, dependencies, system libraries — into one immutable image that runs identically on any machine with a container engine. “Works on my machine” is not a deployment strategy for a system whose behavior depends on pinned versions and prompts.

The lab’s Dockerfile makes five decisions that generalize to every agent service you’ll ship:

Pinned base image (python:3.12-slim) — an implicit latest is a deployment that changes under your feet.
Lockfile-driven installs (uv sync --frozen) — the image contains exactly what uv.lock says, or the build fails.
Non-root user — if the service is compromised (Module 9: Scout reads the open web), the blast radius is a deprivileged user in a sandbox.
HEALTHCHECK probing a cheap /health route — no LLM call. The engine learns whether the service is alive; a load balancer later reuses the probe to route around dead replicas.
Secrets at runtime, never in the image. The NVIDIA key arrives via --env-file at start; anything baked into an image is readable by anyone who can pull it, forever.

One container on one host is still one process on one machine — no extra capacity, no zero-downtime updates, no recovery when the host dies. That’s the orchestration layer’s job, and objective 4.4 wants you to know what Kubernetes actually adds: replicas (N copies, scheduled across machines), load balancing — distributing incoming requests across replicas so no instance saturates while others idle — autoscaling (replica count follows load), rolling updates (replace replicas one at a time; zero downtime), and self-healing (failed health checks trigger replacement). What’s not on the list: nothing about Kubernetes makes your LLM endpoint answer faster or lift its rate limit. Hold that thought.

	Single Docker host	Docker Compose	Kubernetes
When	One service, one machine — this lab	A few cooperating containers (API + Redis), one machine	Many replicas, many machines, uptime requirements
What you gain	Reproducible runtime, isolation, portability	Multi-container wiring in one file, one `docker compose up`	Replicas, load balancing, autoscaling, rolling updates, self-healing
What you pay	No high availability (HA), no scaling — the host is the ceiling	Still one host, no HA — wiring, not orchestration	A control plane (the cluster’s own management services) to run and learn; real ops investment

The honest progression: stay left as long as you can, move right on evidence — Module 3’s discipline, one layer down the stack.

Scaling and sizing: workers, queues, and rate limits

Two words the exam will make you tell apart, because most wrong answers in Domain 4 confuse them. Latency is how long one request takes — you track p50 and p95 (the time under which 50% and 95% of requests finish). Throughput is how many requests the system completes per unit of time — for Scout, jobs per minute. They move independently: adding replicas can double throughput while each job’s latency stays exactly where it was. “Fast for one user, collapses under many” is a throughput problem; “every request is slow, even alone” is latency. Name which one you’re fixing before touching anything.

Three execution tiers, in increasing operational weight:

	In-process background tasks	Worker pool (more uvicorn workers)	External queue (Redis/Celery-style)
What it is	The API process runs jobs on its own threads — this lab	N API processes behind one socket	Broker holds jobs; dedicated workers consume them
Survives a process death?	No — in-flight jobs lost (graph state survives in the checkpointer)	No — same, per worker	Yes — queued jobs persist, get re-delivered
Scales horizontally?	No — capacity is one process	One machine’s worth	Yes — add workers on any machine
Complexity	None — just code	One flag	A broker to run, monitor, version

The lab stays in column one — then makes you measure why columns two and three wouldn’t help this service yet. Run the load script with five concurrent jobs and watch what saturates:

flowchart LR
    C[clients] -->|POST /research<br/>GET /status| A[FastAPI service<br/>ms responses, idle CPU]
    A -->|job runner<br/>worker threads| G[LangGraph graph<br/>supervisor + specialists]
    G <--> K[(SQLite checkpointer<br/>thread_id = job_id)]
    G -->|every LLM call| N[NIM endpoint<br/>build.nvidia.com]
    N -. "40 req/min per account<br/>(as of June 2026)<br/>⬅ THE bottleneck" .-> G

Five concurrent runs, one rate-limited account: LLM calls queue at the endpoint while the web tier serves polls in milliseconds.

A Scout job spends its life making LLM calls — the Planner, supervisor turns, the specialists, the Writer, Module 9’s rails — against one endpoint rate-limited at 40 requests per minute per account (as of June 2026). Five concurrent jobs already contend for that budget; scout/llm.py backs off on 429s — short bursts stretch the run, sustained saturation can still fail a job — while the API process naps between polls. “Add more API workers,” the exam’s favorite trap, adds capacity where spare capacity already exists. When the bottleneck is the model endpoint, throughput comes from levers that act on model calls — pacing work through a queue sized to the rate limit, batching or caching, a second account or endpoint, or a self-hosted NIM whose limits are your hardware’s.

Which is why profiling under load (objective 4.3) comes before scaling decisions: measure p50/p95 per phase, jobs per minute, error rate, and where time accumulates — sizing without a profile is guessing with a budget.

The economics of deployment: hosted endpoints vs. self-hosted NIM

Every LLM call Scout has made since Module 1 hit a hosted endpoint — build.nvidia.com runs the GPUs; you pay per use. The alternative is self-hosting the same model as a NIM container on a GPU you rent or own. Objective 4.5 wants the decision framework, not vendor loyalty:

	Hosted endpoint (build.nvidia.com-style)	Self-hosted NIM on a rented GPU
Cost model	Variable — per request/token; zero at zero traffic	Fixed — an L4 GPU rented on NVIDIA Brev (NVIDIA’s GPU-rental service) runs ~$0.44–0.80/hour (as of June 2026): ~$320–580/month, traffic or not
Rate limits	The provider’s (40 req/min here, as of June 2026)	Your hardware’s — you size it
Data residency	Prompts and outputs transit the provider	Everything stays in your network
Latency control	None — shared infrastructure	Yours to tune (and Module 12 shows the NVIDIA tooling for it)
Ops burden	None — it’s an API	Yours: driver stacks, model updates, monitoring, capacity
HA story	Provider’s problem	≥2 GPUs — the fixed cost doubles

The break-even logic is arithmetic plus two vetoes. Arithmetic: at low or spiky volume, pay-per-use wins — a GPU billing ~$500/month to serve 200 requests a day is mostly paid idleness; at high sustained volume the fixed cost amortizes and self-hosting wins. The vetoes override the math in either direction: a data residency requirement forces self-host; a zero-ops-team reality forces hosted — each regardless of volume.

Whatever you choose, production adds a floor: high availability (HA) — the property that the service survives the failure of any single component, achieved with redundancy (at least two replicas of anything that can die), health checks to detect the death, and load balancing plus client retries to route around it. Your earlier work pays in: stateless API replicas are trivially duplicable because run state lives in the checkpointer, and the /health route exists because a load balancer will need it. HA is decisions made early, cheap — or retrofit late, expensive.

Last piece of objective 4.2’s deployment half: CI/CD — continuous integration/continuous deployment, the automated build-test-deploy pipeline. The pipeline that catches real regressions, in order: offline smoke tests (seconds, no API calls), image build (the artifact you’ll run), then the agent-specific gate — Module 8’s regression evals on the golden set, judge scores compared against the last accepted run. A prompt tweak that tanks grounding will pass your unit tests; only the eval gate catches it before users do. Deploy only on green. The other half of 4.2 — monitoring and governance once traffic flows — is Module 11’s territory.

Hands-on lab: build it

The full lab lives in module-10/ of the labs repo; these are the load-bearing excerpts.

Objective: expose the Module 9 Scout as a containerized FastAPI service with the full job lifecycle, plan approval included.

Observable result: curl -X POST /research returns a job_id immediately; polling shows the plan under awaiting_approval; one approve call later, the status walks through running to done with the cited report — and the same scenario works against docker run. The graph: zero lines changed.

Step 1 — The service layer: a job is a thread

scout/api/jobs.py builds the graph once per process, on the Module 5 checkpointer, and pins the one equality everything depends on:

# module-10/scout/api/jobs.py (excerpt)
@lru_cache(maxsize=1)
def get_graph():
    """One compiled graph per process, on the module-05 SQLite checkpointer."""
    return build_graph(checkpointer=memory.get_checkpointer())

def _config(job: Job) -> dict:
    # thread_id == job_id: a job IS a graph thread. This single equality
    # keeps the server stateless — any process holding the checkpointer
    # can poll or resume any job.
    return {"configurable": {"thread_id": job.job_id, "user_id": API_USER_ID}}

API_USER_ID deserves a glance: every API client shares one anonymous Module 5 memory namespace until authentication arrives — Module 13 territory. Job metadata (status, error) lives in an in-memory dict — an honest shortcut; the “In production” box names the real fix.

Step 2 — `POST /research`: accept, don’t block

# module-10/scout/api/main.py (excerpt)
@app.post("/research", status_code=202, response_model=JobCreated)
async def submit_research(
    body: ResearchRequest,
    idempotency_key: str | None = Header(default=None, alias="Idempotency-Key"),
) -> JobCreated:
    """Accept the job and return immediately. 202 is the honest status
    code — "accepted, not done" — for a run that takes minutes."""
    job, created = jobs.create_job(body.question, idempotency_key)
    if created:
        _spawn(jobs.run_job, job.job_id)  # asyncio.to_thread, off the event loop
    return JobCreated(job_id=job.job_id)

Phase one — submission to interrupt — runs in run_job:

# module-10/scout/api/jobs.py (excerpt)
def run_job(job_id: str) -> None:
    job = JOBS[job_id]
    try:
        result = get_graph().invoke(initial_state(job.question), _config(job))
    except Exception as exc:
        # Recorded, not swallowed: an unhandled crash in a background task
        # is a job that looks alive forever from the client's side.
        job.status, job.error = "failed", f"{type(exc).__name__}: {exc}"
        return
    if "__interrupt__" in result:
        job.status = "awaiting_approval"   # a human's turn
    else:
        _finish(job, result)

Step 3 — Status: registry + checkpointer, merged

GET /status/{job_id} reads metadata from the registry and content (plan, report) from the graph state — a sync def, so the SQLite read runs on FastAPI’s threadpool and never blocks the event loop. The report is exposed only once the status says done.

Step 4 — Approval over HTTP

The approve endpoint enforces the state machine (404 unknown, 409 if there’s nothing to approve), then delivers the verdict with the exact Module 9 resume payload:

# module-10/scout/api/jobs.py (excerpt)
result = get_graph().invoke(
    Command(resume={"approved": approved, "edited_plan": edited_plan}),
    _config(job),
)

Run the lifecycle end to end:

cd module-10 && uv run uvicorn scout.api.main:app --port 8000
# in another terminal:
curl -s -X POST http://127.0.0.1:8000/research \
  -H "Content-Type: application/json" -H "Idempotency-Key: demo-001" \
  -d '{"question": "What is the Nemotron Coalition announced at GTC 2026?"}'
# {"job_id":"job-fa4e1401a262"}  — poll it, approve it, poll it again:
curl -s http://127.0.0.1:8000/status/job-fa4e1401a262      # pending… then awaiting_approval + plan
curl -s -X POST http://127.0.0.1:8000/research/job-fa4e1401a262/approve \
  -H "Content-Type: application/json" -d '{"approved": true}'
curl -s http://127.0.0.1:8000/status/job-fa4e1401a262      # running → done + report

Step 5 — The Dockerfile

# module-10/Dockerfile (excerpt)
# Explicit base tag — an implicit "latest" changes under your feet.
# Stage 1 — builder: a C++ toolchain, because one transitive dependency
# (annoy, via nemoguardrails) ships no prebuilt wheel and compiles from
# source. The toolchain never reaches the runtime image.
FROM python:3.12-slim AS builder
COPY --from=ghcr.io/astral-sh/uv:0.5 /uv /uvx /bin/
RUN apt-get update \
    && apt-get install -y --no-install-recommends build-essential \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev               # the lockfile IS the contract

# Stage 2 — runtime: same pinned base, no compilers, no uv. Only the
# ready-made virtualenv and the code cross the stage boundary.
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /app/.venv /app/.venv
COPY module-10/scout/ scout/
# Non-root: a container that doesn't need root shouldn't have it.
RUN useradd --create-home --uid 1000 scout && chown -R scout:scout /app
USER scout
ENV PATH="/app/.venv/bin:$PATH"
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/health', timeout=2)"
CMD ["uvicorn", "scout.api.main:app", "--host", "0.0.0.0", "--port", "8000"]

Stop the Step 4 uvicorn server first (Ctrl-C) — both want port 8000:

docker build -t scout-api -f module-10/Dockerfile .   # from the repo root
docker run --rm -p 8000:8000 --env-file .env scout-api

The keys arrive at runtime via --env-file — never COPYed, never ENVed, never in a layer (the cached filesystem snapshots an image is built from); .dockerignore keeps .env out of the build context (the set of files sent to the builder — here the repo root) entirely. Since the context is the repo root, the copy BuildKit actually reads is Dockerfile.dockerignore, next to the Dockerfile — see the lab. Re-run the Step 4 curl scenario against the container: same behavior, reproducible anywhere Docker runs.

Step 6 — Load-test it, gently

With the API (or the container) still listening on :8000, from the repo root:

uv run python module-10/scripts/load_test.py --jobs 3

job               status        plan  research   total
job-f1a5316c9d3d  done         30.1s     98.4s  128.6s
job-7d01e702ab6a  done         28.1s    118.5s  146.6s
job-274bfab9c292  done         36.2s     90.4s  126.6s

completed: 3/3   wall time: 146.7s
throughput: 1.23 jobs/min
GET /status latency: p50 7 ms   p95 12 ms

(Your numbers will differ; the shape won’t.) Per-job time stretches as concurrency grows — the jobs queue on the endpoint’s 40 req/min — while GET /status stays at single-digit milliseconds. The bottleneck is the model endpoint; no number of API workers would move that throughput line.

Verify — offline tests drive a scripted fake graph through the real routes; the live marker runs one real job end to end:

uv run pytest module-10/tests/                       # offline, no API calls
SCOUT_LIVE_TESTS=1 uv run pytest module-10/tests/    # + 1 real research job

Try it yourself (no solution provided):

Edit the plan over HTTP. While a job is awaiting_approval, GET its plan, change one step’s search_queries, POST it back as edited_plan with approved: true. Confirm via status that the graph executed your plan.
Streaming status with SSE. Add GET /research/{job_id}/events pushing status changes as Server-Sent Events (StreamingResponse, media_type="text/event-stream"). Keep the polling route: SSE is the upgrade, polling is the contract.

In production

What separates this lab from what a platform team would sign off on: the job registry moves from an in-process dict to Redis or Postgres, and the checkpointer from local SQLite to Postgres — that pair makes horizontal scaling real rather than nominal, since metadata and state become visible across replicas. The API grows authentication and per-client rate limiting (your 40 req/min budget is one noisy client away from exhaustion), and the Idempotency-Key header gets honored for real — persisted, scoped, expiring; Module 13 hardens both that and partial results. Polling stops being the only window: production APIs stream progress over SSE or WebSockets — Module 2’s streaming lesson applied at the service tier. Deploys go blue-green: the new version takes traffic only after its health checks and eval gate pass; the old one stays warm for instant rollback. SLOs (service-level objectives) make the implicit explicit — “p95 time-to-plan under 60s; 99% of accepted jobs reach a terminal state” — because an unmeasured promise is a future incident. Managed platforms (LangGraph Platform among them) sell parts of this lifecycle off the shelf — the older LangServe path is in maintenance mode, not worth new investment — but what you built by hand is exactly what you’d be configuring, or debugging, in theirs.

Exam corner

What the exam tests here. Per the official blueprint, Deployment and Scaling carries 13% of the exam. The study guide’s objectives: 4.1 deploy and orchestrate multi-agent systems at production scale — the job API and lifecycle; 4.2 apply MLOps (ML operations) practices for CI/CD, monitoring, and governance — this module owns the CI/CD half (smoke tests, image build, eval gate); the monitoring half belongs to Module 11, and questions usually signal which half with “before release” vs. “in operation”; 4.3 profile performance and reliability under distributed system loads — p50/p95, jobs/min, finding the bottleneck; 4.4 scale deployments using containerization (Docker, Kubernetes) with load balancing; 4.5 optimize deployment costs while ensuring high availability. This module also reinforces 2.5 (dynamic conversation flows with real-time streaming): the mechanics are Module 2’s graph.stream(); here you saw the service-level version in concept.

Quiz — answers after question 5.

A research agent takes 2–5 minutes per run and pauses mid-run for a human to approve its plan. Browser and script clients must trigger runs and deliver approvals. Which exposure pattern fits?
- A) A synchronous POST with the server timeout raised to 10 minutes
- B) An async job API: POST returns a job_id (202), clients poll a status route, approval is its own endpoint resuming the run from persisted state
- C) An hourly batch: queue questions in a file, email reports
- D) A WebSocket held open for the whole run, approval delivered over the same socket
A service meets its p95 latency target with 10 concurrent users. At 200, status requests time out; the single API container’s CPU is pegged while the LLM endpoint sits well under its rate limit. The right lever?
- A) Switch to a larger, more capable model
- B) Raise the model’s token budget per call
- C) Add API replicas behind a load balancer
- D) Tell clients to poll less often
A team on a single Docker host needs zero-downtime updates, automatic replacement of crashed containers, and scaling across machines — and keeps getting 429s from its hosted LLM endpoint. Which statement is correct?
- A) Moving to Kubernetes addresses all four needs, including the 429s
- B) Kubernetes adds rolling updates, self-healing, autoscaling, and load balancing — but the 429s persist: an external endpoint’s rate limit is independent of your container orchestration
- C) Docker alone already load-balances across hosts
- D) Kubernetes is required to run more than one container of an image
An internal agent serves ~200 requests/day in unpredictable bursts. No data-residency constraint, no ops team; the service must survive single-component failures. The most cost-sound setup?
- A) Hosted model endpoints (pay-per-use) with at least two stateless API replicas behind a load balancer
- B) One self-hosted GPU running a NIM, to control latency
- C) Two self-hosted GPUs running NIMs, for model-side high availability
- D) The largest hosted model available, for maximum reliability
Which CI/CD pipeline is correctly designed for an agent service?
- A) Live end-to-end agent runs on every commit, deploy when green
- B) Build the image, deploy it, then evaluate on production traffic
- C) Deploy on merge; production monitoring catches regressions
- D) Offline smoke tests, then image build, then a regression-eval gate on the golden set, then deploy

Answers. 1 — B. Minutes-long work plus a mid-run human rules out anything that couples the run to a connection: A dies at every proxy and loses work on redeploy; D deadlocks when the approver isn’t at the socket; C ignores interactive approval. Submit / poll / resume is the 4.1 pattern. 2 — C. The profile says the web tier is saturated and the model isn’t — add capacity where the system lacks it: replicas + load balancing. A and B act on the model (not the bottleneck); D sheds load instead of serving it. 3 — B. The orchestration layer buys replicas, load balancing, autoscaling, rolling updates, self-healing. An external endpoint’s rate limit survives any amount of Kubernetes — that lever is queues, pacing, batching, another endpoint, or self-hosting. 4 — A. Low, spiky volume is the textbook pay-per-use case: a ~$320–580/month GPU (L4 on Brev, as of June 2026) idling on 200 requests a day fails the arithmetic; C doubles the failure. HA still applies to your tier — two stateless replicas are cheap because state lives in the persistence layer. 5 — D. Cheap, deterministic checks first; build the artifact you’ll ship; gate on the regression evals that catch what unit tests can’t; deploy last. A burns tokens on every commit for flaky signal; B and C discover regressions in production — what the gate exists to prevent.

Traps to avoid:

Latency vs. throughput. A faster model improves latency, not necessarily throughput; more replicas improve throughput, not latency. Scenarios quoting both “slow for everyone” and “collapses under load” test whether you treat them as one disease.
“Add more workers.” More API workers help only where the system has idle capacity to use. When the bottleneck is the LLM endpoint’s rate limit, the levers are queues/pacing, batching, caching, or another endpoint — never the web tier.
“Docker = scalable.” A container is packaging. Replication, load balancing, autoscaling, and rolling updates come from the orchestration layer — a single Docker host has none of them.
“Cheapest” ≠ “cost-effective” (4.5). High availability applies to your stateless tier — two cheap replicas — while an idle self-hosted GPU fails the arithmetic. Rank the options on the scenario’s qualifier, not on the sticker price.

Key takeaways

An agent behind an API is a job, not a function: submit returns a job_id (202), polling reads status, approval is its own request — runs outlast connections, and humans sit mid-run.
The checkpointer is the deployment backbone: thread_id = job_id makes the server stateless, restarts harmless, and any replica able to resume any job.
The Module 9 interrupt crosses the API unchanged: awaiting_approval exposes the plan; {approved, edited_plan?} resumes the graph. The API adds transport, never semantics.
Docker gives reproducibility and isolation — not scaling. Replicas, load balancing, autoscaling, and rolling updates come from the orchestration layer.
The bottleneck of an agent service is almost always the LLM endpoint and its rate limit — profile first; API workers add no throughput there.
Latency is how long one request takes (p50/p95); throughput is how many complete per minute. Different diseases, different cures.
Hosted vs. self-hosted is an equation — variable cost vs. fixed, overridden by data residency and ops capacity — and high availability means ≥2 of anything that can die, with health checks and retries.

Keep going

Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.

Scout is online — but is it healthy? Next module: tracing every node, cost per request, and quality alerts when the eval score drifts.

Lab code · Course index · ← Module 9 · Module 11 →

References

NCP-AAI certification page — the official blueprint; Deployment and Scaling is weighted at 13%.
Background Tasks — FastAPI on running work after the response, and when to graduate to a real queue.
LangGraph persistence — checkpointers, threads, resuming — the mechanics statelessness stands on.
Multi-stage builds — Docker’s guide to lean images; the pattern the lab’s builder/runtime Dockerfile applies.
Kubernetes Glossary — NVIDIA on container orchestration, scaling, and load balancing; an official study-guide reading for Domain 4.