Agent Memory: State, Persistence, and Long-Term Recall (NCP-AAI Module 5)

Module 5 of 14 22 min read D5 · 10% Lab code ↗

This is Module 5 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.

Four minutes into a research run, Scout is doing everything right: the plan is solid, two of five steps are done, results are piling up. Then the process dies — a 429 that exhausts its retries, a dropped connection, your own Ctrl+C. Everything lived in RAM. The plan, the critique, the tool results: gone. You rerun from zero and pay every token a second time. Multiply by every crash and every flaky network, and “stateless agent” starts reading as “token bonfire with extra steps.”

There’s a second, quieter pain: every session, you tell Scout the same thing — concise bullets, English, cite primary sources — and it forgets you completely by the next run.

By the end of this module, kill -9 costs you nothing — Scout resumes mid-run, exactly where it died — and it remembers your preferences from one session to the next.

In this module

  • You’ll learn:
    • Distinguish short-term, long-term, and episodic memory — and map each to a concrete storage mechanism (graph state, checkpointer, store).
    • Implement session persistence with a SQLite checkpointer: resume after a crash, inspect state history, time-travel to a past checkpoint.
    • Build a cross-thread long-term store (preferences, covered topics) that the Planner reads to personalize research plans.
    • Manage the context window — trimming, summarization, and the discipline of context engineering — and decide what to remember, where, and for how long.
  • You’ll build: A Scout that survives crashes (SQLite checkpointer) and remembers user preferences across sessions (long-term store read by the Planner).
  • Exam domains covered: D5 — Cognition, Planning, and Memory — 10% of the exam (this module covers the memory half; Module 4 covered planning and reasoning).
  • Prerequisites: Modules 1–4 (your LangGraph graph with the Planner node runs); NVIDIA API key configured, plus your Tavily key from Module 2 — every research run in the lab calls web_search.

Where you are

  • ✅ Module 1 — What Is Agentic AI? — vocabulary, landscape, first NIM call
  • ✅ Module 2 — Build Your First AI Agent — ReAct loop, tool calling, first graph
  • ✅ Module 3 — Agent Architecture — patterns, trade-offs, Scout’s design doc
  • ✅ Module 4 — Cognition — Planner node, reflection loop, reasoning budgets
  • 👉 Module 5 — Agent Memory (you are here)
  • ⬜ Modules 6–14 — RAG, multi-agent, evals, guardrails, deployment, the exam

Scout before: a single-graph agent with a Planner — and total amnesia: everything in RAM, a crash erases the run, no run knows any other existed. Scout after: every step of every run is persisted per thread, and a long-term memory of you — preferences, covered topics — survives across sessions and feeds the Planner.

Why your agent forgets everything: the memory stack

Start from the uncomfortable fact: the LLM is stateless. Every API call starts from zero — the “conversation” is an illusion your own code maintains by re-sending the full transcript with every call. You’ve done this since Module 2: messages[] is the memory, and your code is its custodian. So every kind of agent memory is engineering around the model, never inside it. The question is never “does my agent remember” — it’s “which component remembers this, and how long does it live?”

That question has a standard vocabulary, and the exam uses it precisely. Short-term memory is what the agent knows during one run: the graph state plus the message history inside the current context window — rich, exact, gone when the run ends. Long-term memory is whatever survives across sessions: anything deliberately written to durable storage. Within long-term memory, three flavors matter. Semantic memory stores facts — “this user wants concise English reports.” Episodic memory stores experiences — the trace of what the agent did and what happened, like “we researched the EU AI Act on June 10”; it’s the flavor behind objective 5.5, because an agent that adapts its behavior based on prior experiences needs a record of those experiences to consult. Procedural memory stores how to do things — for an LLM agent, the prompts and code themselves, versioned in git rather than in a database.

The “who stores what” map, which the exam tests in scenario form:

Memory typeScopeLifetimeStorage mechanism in ScoutExample
Short-term (working)One run / one threadThe run and its context windowScoutState: question, messages, plan, plan_iterationsThe tool results gathered so far in this research
Long-term — semanticCross-thread, per userUntil invalidatedStore, key preferences”Concise bullets, English, overview depth”
Long-term — episodicCross-thread, per userUntil purgedStore, key covered_topics”Researched the EU AI Act on 2026-06-10”
Long-term — proceduralCross-thread, all usersVersioned with the codeSystem and planner prompts, node code”Always cite sources as full URLs”

In Scout, this stack is three physical layers: state in RAM during a run, a checkpointer snapshotting that state to SQLite per thread, and a store holding cross-thread memory in the same SQLite file, namespaced per user — read by the Planner before it drafts a plan:

flowchart LR
    subgraph st ["Short-term: one run"]
        Q["ScoutState<br/>question · messages · plan · plan_iterations"]
    end
    subgraph ck ["Thread-scoped persistence"]
        DB[("SQLite checkpointer<br/>one snapshot per graph step,<br/>keyed by thread_id")]
    end
    subgraph lt ["Cross-thread: long-term memory"]
        KV[("Store<br/>namespace ('users', user_id)<br/>preferences · covered_topics")]
    end
    Q -->|"saved after every graph step"| DB
    DB -->|"resume / time travel"| Q
    PL["Planner node"] -->|reads| KV
    RUN["end of run"] -->|writes| KV

Scout’s memory stack: state (in-run) → checkpointer (per thread) → store (across threads). The Planner reads the store; the checkpointer works underneath the graph, no reads required. The rest of the module walks down this stack, one layer at a time.

Short-term memory and context hygiene

The context window is a finite, billed resource — and an agent loop is a machine for filling it. Every iteration appends an assistant turn and tool observations to messages[], and every iteration re-sends all of it: a 12-turn run doesn’t bill 12 turns, it bills roughly the sum of 1 through 12. Quality degrades too. Models attend best to the start and end of long prompts and lose material parked in the middle — the “lost in the middle” effect. I won’t pin a number on where degradation starts, but I’ve watched runs get worse as their transcripts got longer, and so will you.

So the transcript needs hygiene. Three standard strategies: trim (keep the system prompt, the question, and the last N messages — brutal and free); summarization — compressing older history into a rolling summary that replaces the messages it covers, one extra LLM call for the gist of everything without the tokens of everything; and selection — include only what’s relevant to the current step, the retrieval mindset Module 6 builds machinery for.

StrategyCostFidelityLatencyWhen to use
Full historyHighest — the whole transcript re-billed every turnPerfect, until the window overflowsGrows every turnShort runs; debugging, where you want everything
Trim (keep last N)LowestRecent detail exact; older context gone abruptlyNone addedLong tool loops where only recent context matters — Scout’s executor
Rolling summarizationOne extra LLM call per compressionGood gist, lossy on specificsOne call’s worthLong conversations where early decisions must stay in view

The name for this discipline is context engineering — treating what enters the model’s context as a deliberate design decision, with a budget, rather than letting history accumulate by default. In an agent, most of the prompt isn’t written by you — it’s accreted by the loop — so deciding what gets in is as much design as the system prompt itself. Scout’s lab applies the cheapest tool, a trim bound before every executor call, and leaves rolling summarization as your exercise.

One boundary to keep sharp, because the exam probes it: the context window is short-term memory’s ceiling, not a long-term mechanism. A million-token window doesn’t remember last week — it makes this run’s working memory bigger, at this run’s prices.

Persistence: checkpointers, threads, and time travel

Everything so far lives and dies with the process. The fix is the checkpointer — a persistence layer that saves a snapshot of the full graph state after every super-step (one round of graph execution: a node — or several running in parallel — finishes and its state update is applied), so a run can be resumed, inspected, or replayed instead of restarted. In LangGraph you attach one at compile time; from then on, every state transition is written through before the next node runs. Our storage is SQLite — one local file, zero infrastructure, the same mental model as the Postgres checkpointer you’d run in production.

Checkpoints are organized by thread — one persisted sequence of graph runs identified by a thread_id, passed at invocation through config["configurable"]. The thread is the unit of “same conversation”: invoke twice with the same thread_id and the second call continues from the first’s final state; a fresh thread_id is a blank slate. Note what this implies — identity lives in the config, not the state. ScoutState gains no fields in this module, none: persistence is infrastructure underneath the graph, not data inside it.

What a checkpointer buys you:

  • Crash recovery. The run that died at step 3 of 5 resumes at step 3 of 5; already-paid LLM calls are read back from disk, not re-billed. This is the heart of objective 5.4 — stateful orchestration: coordinating multi-step work whose state outlives any single process.
  • Multi-turn sessions. A follow-up can land on the same thread and see the full prior context — across process restarts, not just within one. (That’s what the machinery supports; Scout’s CLI uses the shared thread for crash recovery only and doesn’t take a second question on a finished thread — a conversational front-end would build on exactly this.)
  • Time travel — reading a thread’s checkpoint history and resuming from any past checkpoint, forking the run from that point. It turns “why did the run derail at step 4?” from archaeology into a replay.

The human-in-the-loop interrupts covered in Module 9 — pause the graph, let a human approve the plan, resume — are built directly on this checkpoint machinery.

Here’s the kill-and-resume mechanic you’ll run in the lab:

sequenceDiagram
    participant U as You
    participant G as Scout's graph
    participant C as SQLite checkpointer
    U->>G: run.py "question" --thread report-42
    G->>C: checkpoint (plan v1)
    G->>C: checkpoint (critique)
    G->>C: checkpoint (plan v2)
    Note over G: Ctrl+C — the process dies
    U->>G: run.py --thread report-42
    G->>C: load latest checkpoint
    C-->>G: plan v2, executor pending
    Note over G: resumes at the executor —<br/>3 planning calls NOT re-paid

Now, the distinction this module exists to teach. A checkpointer is thread-scoped: it persists one thread’s state, addressable only by that thread_id. Cross-session memory — “remember this user prefers bullet points, whatever thread they open tomorrow” — needs a different scope: the store, covered next. Side by side:

CheckpointerStore
ScopeOne thread (thread_id)Cross-thread (namespace, e.g. ("users", user_id))
What it savesThe full graph state, automatically, every super-stepKey-value documents you choose to write
Typical contentTranscript, plan, intermediate resultsPreferences, user facts, episodes
LifetimeThe session and its replay historyAcross sessions, until invalidated
Scout usageKill-and-resume, time travelPlanner personalization, covered topics

Long-term memory: profiles, facts, and the store

The store is the cross-thread half: a namespaced key-value memory that nodes and application code read and write deliberately, organized as (namespace, key) → document. The namespace tuple — ours is ("users", user_id) — is the isolation boundary: one user’s memories are unreachable from another’s namespace by construction.

What earns a place in Scout’s store? Two keys per user:

  • preferences — semantic memory: report style, depth, language. Written rarely, read by the Planner on every run.
  • covered_topics — light episodic memory: {topic, date} entries, one per completed run. The Planner reads it to avoid re-planning covered ground — “we researched X on June 10; extend it, don’t repeat it.” That’s objective 5.5 in working clothes: behavior adapted from recorded experience.

When to write is a design decision. Writing in the hot path — inside a node, during the run — guarantees the memory lands, but adds latency and a failure mode to every run; writing in the background keeps runs lean but can lose the write if you crash first. Scout writes covered_topics after the graph finishes: a lost topic costs one duplicated search someday; a fragile hot path costs every run.

One mention for completeness: vector memory — storing memories as embeddings (numeric vectors that encode meaning, so similar texts land near each other) so the agent retrieves them by semantic similarity instead of by exact key. It’s the right tool once memories number in the thousands and “which ones are relevant?” becomes a search problem. We build exactly that machinery in Module 6 — for documents first; pointing it at memories is the same trick.

Deciding what to remember

The lab gives you mechanisms; production gives you choices. A workable memory policy answers three questions per item. What: store conclusions and stable facts, not raw transcripts — checkpoints already keep the raw material, and stale detail misleads the Planner. Where: thread-scoped context goes to the checkpointer automatically; only what must cross sessions earns a store write. How long: every memory needs an expiry or an invalidation trigger, because a wrong remembered “fact” is worse than no memory at all. And the moment memories describe people, minimization applies — store the least you need, purge on request; the compliance side arrives with Module 9.

Hands-on lab: build it

Objective: give Scout session persistence (a SQLite checkpointer) and a long-term store of preferences and covered topics, read by the Planner. The full code lives in module-05/ of the labs repo.

Observable result: kill a run mid-flight and resume it for free; uv run python -m scout.memory --show (from module-05/) prints what Scout knows about you; a second research run’s plan acknowledges the topic the first one covered.

One new dependency — the only one this module (from the repo root):

uv add "langgraph-checkpoint-sqlite~=3.1"

Step 1 — Plug in the checkpointer

The new scout/memory.py is the whole memory layer. The checkpointer factory builds a SqliteSaver on a plain sqlite3 connection — not the from_conn_string context manager, which would close the saver when the with block exits; the CLI needs it alive for the entire run:

# module-05/scout/memory.py (excerpt)
from langgraph.checkpoint.serde.jsonplus import JsonPlusSerializer
from langgraph.checkpoint.sqlite import SqliteSaver

MODULE_DIR = Path(__file__).resolve().parents[1]
DB_PATH = MODULE_DIR / "scout_memory.db"  # gitignored: memories are personal

def get_checkpointer(db_path: Path | None = None) -> SqliteSaver:
    connection = sqlite3.connect(db_path or DB_PATH, check_same_thread=False)
    serde = JsonPlusSerializer(
        allowed_msgpack_modules=[("scout.planner", "ResearchPlan")]
    )
    return SqliteSaver(connection, serde=serde)

The serializer’s msgpack allowlist — msgpack being the compact binary format checkpoints are serialized in — names ResearchPlan explicitly: ScoutState.plan is a Pydantic model, and langgraph-checkpoint (4.x in our lockfile) warns on — and, with LANGGRAPH_STRICT_MSGPACK=true, blocks — deserializing custom types it was not told to trust.

graph.py changes in exactly two places — the signature and the compile call. Topology untouched:

from langgraph.checkpoint.base import BaseCheckpointSaver

def build_graph(checkpointer: BaseCheckpointSaver | None = None):
    builder = StateGraph(ScoutState)
    # ... nodes and edges exactly as in module 04 ...
    return builder.compile(checkpointer=checkpointer)

And run.py threads every invocation through a thread id:

graph = build_graph(checkpointer=memory.get_checkpointer())
run_config = {"configurable": {"thread_id": thread_id, "user_id": args.user}}
# ... the resume check (step 2) decides payload ...
for mode, chunk in graph.stream(payload, run_config, stream_mode=["updates", "custom"]):

That’s the entire integration: from here, LangGraph checkpoints every super-step without another line from you.

Step 2 — Kill it, resume it

Start a run with an explicit thread, and kill it once the executor starts searching:

cd module-05
uv run python -m scout.run "What is the Nemotron Coalition that NVIDIA announced at GTC 2026?" --thread nemotron-gtc26
[thread] nemotron-gtc26  [user] default

=== Plan v1 (draft) ===
Objective: Determine the purpose, composition, and announced initiatives of the Nemotron Coalition unveiled by NVIDIA at GTC 2026.
  1. Locate the official announcement of the Nemotron Coalition from GTC 2026.
     queries: NVIDIA GTC 2026 Nemotron Coalition announcement, ...
  ...

=== Critique ===
Plan critique (address every point in your revision):
- No step explicitly addresses the broader context or significance of the Coalition ...
- The plan omits any mechanism for evaluating the credibility of the announcement ...

=== Plan v2 (revised) ===
Objective: Document the purpose, composition, announced initiatives, and strategic significance of the Nemotron Coalition unveiled by NVIDIA at GTC 2026, confirming its official announcement and credibility.
  ...
^C
KeyboardInterrupt

Three planning calls, paid and checkpointed. Relaunch with the same thread and no question:

uv run python -m scout.run --thread nemotron-gtc26
[thread] nemotron-gtc26  [user] default
[resume] picking up at: agent — nothing already paid is re-paid
[agent] tool_call: web_search({"query": "NVIDIA Nemotron Coalition GTC 2026 press release"})
[tools] 5 results
[agent] tool_call: web_search({"query": "NVIDIA press release Nemotron Coalition GTC 2026 site:nvidia.com"})
[tools] 5 results
[agent] tool_call: web_search({"query": "Nemotron Coalition purpose statement NVIDIA site:nvidianews.nvidia.com"})
[tools] 5 results
**Nature of the coalition**
NVIDIA described the **Nemotron Coalition** as "a global collaboration between
open-model builders and AI developers ..." ... (cited answer)
[memory] recorded in covered_topics for user 'default'

The resume logic in run.py is four lines — and one subtlety worth memorizing:

snapshot = graph.get_state(run_config)
if snapshot.tasks:
    # Interrupted thread: resume from the last checkpoint. The input must
    # be None — passing the question again would APPEND to the saved state.
    payload = None

get_state(...) returns the tasks that were pending when the process died — snapshot.next names them, and checking snapshot.tasks also catches a kill that landed after a node’s result was saved but before its super-step committed (that saved result is replayed, not re-paid). Invoking with None means “continue from the checkpoint”; re-passing the question would append a duplicate to the persisted transcript, not replace it. One precision on the resume banner: “nothing already paid” means nothing checkpointed — the executor call that was mid-flight when you hit Ctrl+C was never saved, and restarts from zero.

Step 3 — Time travel

Every checkpoint of a thread is readable — the CLI below wraps graph.get_state_history(config):

uv run python -m scout.memory --history nemotron-gtc26
1f165325-...-800a  next=END     messages=11
1f165324-...-8009  next=agent   messages=10
1f165324-...-8008  next=tools   messages=9
...
1f165323-...-8003  next=agent   messages=4
1f165323-...-8002  next=planner messages=3
1f165323-...-8001  next=critic  messages=2
1f165322-...-8000  next=planner messages=2
1f165322-...-bfff  next=__start__  messages=0

To replay from any point, pass that checkpoint’s id alongside the thread id — LangGraph forks the thread from there (runnable in a REPL, uv run python from module-05/):

from scout import memory
from scout.graph import build_graph

graph = build_graph(checkpointer=memory.get_checkpointer())
config = {"configurable": {"thread_id": "nemotron-gtc26", "checkpoint_id": "<id>"}}
graph.invoke(None, config)   # replays from AFTER that checkpoint

This is your debugging superpower for the rest of the course: rewind to the checkpoint before the derailment and replay — no re-paying earlier steps, no praying the failure reproduces.

Step 4 — The long-term store

langgraph-checkpoint-sqlite 3.1 ships a ready-made SqliteStore too — but we build our own in ~30 lines, because thirty lines of SQLite teach the interface better than an import. It mirrors LangGraph’s BaseStore contract (put/get against a namespace tuple), so swapping in the shipped store — or a managed one — later touches no caller:

class ScoutStore:
    """Cross-thread key-value memory: (namespace tuple, key) -> JSON dict."""

    def __init__(self, db_path: Path | None = None) -> None:
        self._conn = sqlite3.connect(db_path or DB_PATH, check_same_thread=False)
        self._conn.execute(
            "CREATE TABLE IF NOT EXISTS store ("
            " namespace TEXT NOT NULL, key TEXT NOT NULL, value TEXT NOT NULL,"
            " PRIMARY KEY (namespace, key))"
        )
        self._conn.commit()

    def put(self, namespace: tuple[str, ...], key: str, value: dict) -> None:
        self._conn.execute(
            "INSERT OR REPLACE INTO store (namespace, key, value) VALUES (?, ?, ?)",
            ("/".join(namespace), key, json.dumps(value)),
        )
        self._conn.commit()

    def get(self, namespace: tuple[str, ...], key: str) -> dict | None:
        row = self._conn.execute(
            "SELECT value FROM store WHERE namespace = ? AND key = ?",
            ("/".join(namespace), key),
        ).fetchone()
        return json.loads(row[0]) if row else None

Same SQLite file as the checkpointer, different table, different scope. Inspect and edit it from the CLI:

uv run python -m scout.memory --show
uv run python -m scout.memory --set style "deep dive with code samples"

Step 5 — The Planner reads memory; the run writes it

The read side: planner.py opens every planning prompt with what the store knows — preferences, and recent covered topics with an instruction to build on them instead of repeating them:

def _memory_context(user_id: str) -> str:
    store = memory.get_store()
    preferences = memory.read_preferences(store, user_id)
    lines = ["User preferences (tailor the plan to these):"]
    lines += [f"- {field}: {value}" for field, value in preferences.items()]
    topics = memory.read_covered_topics(store, user_id)
    if topics:
        lines.append(
            "Topics already covered in past sessions. Do not re-plan them "
            "from scratch: if the question overlaps one, SAY SO in the "
            "plan's objective and focus the steps on what is new:"
        )
        lines += [f"- {entry['topic']} (covered {entry['date']})" for entry in topics[-5:]]
    return "\n".join(lines)

The user_id comes from config["configurable"] — the same channel as thread_id, and just as deliberately not a state field. To receive it, planner_node gains a second parameter that LangGraph injects into any node that declares it:

from langchain_core.runnables import RunnableConfig

def planner_node(state: "ScoutState", config: RunnableConfig) -> dict:
    user_id = config["configurable"].get("user_id", "default")

The parameter must be named config (typed RunnableConfig) for the injection to happen — and inside the function it shadows the imported config module, which is why planner.py pulls MAX_PLAN_ITERATIONS in as a bare name.

The write side is one line in run.py, after the graph completes — off the hot path, as argued above:

memory.record_covered_topic(memory.get_store(), args.user, final["question"])

Close the loop: run two neighboring questions on two different threads and watch the second plan acknowledge the first run’s ground. Two threads, one memory — the cross-thread scope doing its job.

Step 6 — Context hygiene: trim before every call

The last increment is ten lines in memory.py (KEEP_LAST_MESSAGES = 12), applied in one line in the executor: bound what the model sees without touching what the state keeps:

def trim_transcript(messages: list[dict], keep_last: int = KEEP_LAST_MESSAGES) -> list[dict]:
    """System prompt + question + the last keep_last messages. Never start
    the kept tail on a tool observation — its parent assistant turn must
    stay in view, or the API rejects the transcript."""
    if len(messages) <= keep_last + 2:
        return messages
    head, tail = messages[:2], messages[-keep_last:]
    while tail and tail[0]["role"] == "tool":
        tail = tail[1:]
    return head + tail

The guard at the end matters: a kept tail that opens on a tool observation with no parent tool_calls turn is a protocol violation the API rejects. Trimming is easy; trimming without breaking the tool-call pairing is the actual skill.

One trade-off to know about: the protected head is [system, question] only, and the plan hand-off message (“Follow this research plan…”) sits at index 3. With the course budget of at most 3 searches the plan stays in view — but that budget is a prompt instruction, not a hard limit. A run that ignores it and spends the real cap of MAX_ITERATIONS = 6 turns (some carrying several parallel tool calls) can already push the hand-off out of the window. Widen KEEP_LAST_MESSAGES (or protect the hand-off message) so the plan you paid three deliberation calls for doesn’t silently fall out.

Run the suite — and the cumulative rule, as always (from the repo root):

uv run pytest module-05/tests/
uv run pytest module-01/tests/ module-02/tests/ module-03/tests/ module-04/tests/ module-05/tests/

Try it yourself (no solution provided):

  1. Preferred sources. Add a preferred_sources preference (say, "arxiv.org, official docs") and make the Planner fold it into each step’s search_queries — one edit in _memory_context(), one in the planner prompt.
  2. Rolling summary. When the transcript exceeds N turns, compress the middle with one summarization call (“compress into a paragraph, keep all URLs”) instead of dropping it. Compare token spend and answer quality against plain trimming — fidelity vs. cost, measured.

Exam corner

What the exam tests here. Per the official study guide, Domain 5 (10%) expects you to: implement memory mechanisms for short- and long-term context retention (5.1 — this module’s taxonomy and its three storage layers); manage stateful orchestration to coordinate complex tasks and knowledge retention (5.4 — threads, crash recovery, time travel); and adapt reasoning strategies based on prior experiences (5.5 — episodic memory feeding the Planner; the reflection angle was Module 4’s). Note also objective 1.4 in Domain 1 — “manage short-term and long-term memory for context retention” — nearly word-for-word the same skill: memory questions can pay you twice across two domains.

Quiz — answers after question 5.

  1. A support agent persists conversations with a checkpointer. Users complain that preferences they state on Monday are gone when they open a new conversation on Tuesday. What’s missing?

    • A) A larger context window, so the preferences stay in the prompt
    • B) A cross-thread store keyed by user — checkpointers are thread-scoped, and a new conversation is a new thread
    • C) More frequent checkpoints, so the preference is captured sooner
    • D) A higher MAX_ITERATIONS, so the conversation can continue longer
  2. A document-processing agent runs 40-minute multi-step jobs. After a crash at step 7 of 9, the business requirement is “do not redo the first six steps.” Which mechanism delivers that?

    • A) Retry logic with exponential backoff around each API call
    • B) A longer system prompt instructing the model to be more careful
    • C) A checkpointer persisting graph state every super-step, with the job resumed on its existing thread
    • D) Lowering temperature so the run fails less often
  3. An agent should adapt its approach based on what happened in previous runs — for example, avoiding a data source that failed twice last week. Which memory type stores that signal?

    • A) Short-term memory — keep all past runs in the context window
    • B) Semantic memory — store the fact “source X is unreliable” as a preference
    • C) Episodic memory — a record of past runs and their outcomes, consulted before acting
    • D) Procedural memory — retrain the prompts after every run
  4. A long-running assistant conversation is degrading: answers are slower, costs climb every turn, and the model misses instructions given early on. Best remediation?

    • A) Increase max_tokens so the model has room for every message
    • B) Summarize older history into a rolling summary and keep recent turns verbatim
    • C) Clear the entire history every turn for a clean slate
    • D) Switch to a model with a bigger context window and keep everything
  5. A run produced a wrong report, and you suspect step 4 of the plan went sideways. You want to re-execute from just before step 4 — without paying for steps 1–3 again and without losing the original run. What do you use?

    • A) Re-run the whole job with the same seed and watch step 4 closely
    • B) The thread’s state history: fork from the checkpoint preceding step 4 and replay forward
    • C) Grep the application logs and reconstruct the state by hand
    • D) Delete the thread and start over with a more detailed plan

Answers. 1 — B. “New conversation” means new thread, and a checkpointer’s memory ends at the thread boundary. A keeps the preference only within one thread’s transcript anyway; C misunderstands the scope problem — frequency doesn’t cross threads; D is unrelated. 2 — C. “Don’t redo completed steps after a crash” is the checkpointer’s defining feature — persisted state, resumed on the same thread. A retries a call, not a job: it can’t recover work after the process dies. B and D reduce nothing about crash loss. 3 — C. “What happened in previous runs” is the definition of episodic memory. B has a kernel of truth — repeated episodes may later be distilled into a semantic fact — but the signal itself (“failed twice last week”) is a record of experiences. A doesn’t survive sessions; D confuses prompts with run history. 4 — B. Slower + costlier + “lost in the middle” is unmanaged transcript growth; rolling summarization keeps the gist and caps the size. A spends more on output, fixing nothing about the bloated input; C destroys the context the conversation needs; D pays more to delay the same degradation. 5 — B. This is time travel: get_state_history to find the checkpoint, fork from it, replay forward — the original thread stays intact. A re-pays everything and LLM runs aren’t reproducible by seed alone; C reconstructs state the checkpointer already has; D throws away the evidence.

Traps to avoid:

  • “The LLM remembers the conversation.” It doesn’t. The model is stateless; every appearance of memory is engineering around it — state, checkpointer, store. Questions that personify model memory test whether you know where memory actually lives.
  • Checkpointer ≡ long-term memory. The trap of this module. Thread-scoped vs. cross-thread is the distinction; if the scenario crosses a session or user boundary, a checkpointer alone is the wrong answer.
  • “More context is always better.” Context is a budget: cost and latency scale with it, and quality can drop as relevant material drowns in the middle. A bigger window is not a memory strategy.

Key takeaways

  • The LLM is stateless: every memory your agent has is a component you built — graph state, checkpointer, or store.
  • Short-term memory is the state and transcript of one run; long-term memory is whatever you deliberately persist across sessions — semantic (facts), episodic (experiences), procedural (prompts and code).
  • A checkpointer persists full graph state per thread: crash recovery, multi-turn sessions, time travel. Identity (thread_id, user_id) travels in config["configurable"], never in the state schema.
  • A store is the cross-thread half: namespaced key-value memory the Planner reads to personalize plans. Checkpointer vs. store — thread vs. cross-thread — is the distinction Domain 5 leans on hardest.
  • Context engineering treats the context window as a budget: trim or summarize by design, because “keep everything” degrades quality while costing the most.
  • What to remember, where, and for how long is a design decision — store conclusions, not transcripts; give every memory an expiry; minimize anything personal.

Keep going

Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.

Scout remembers you now — next, we give it a library: ingesting sources into a vector store and answering with citations.

Lab code · Course index · ← Module 4 · Module 6 →

References

  • LangGraph: Persistence — the checkpointer, threads, get_state_history, and time travel, in the current 1.x docs.
  • LangGraph: Memory — short-term vs. long-term memory, the store and its namespaces, and the semantic/episodic/procedural framing used in this module.
  • What Is AI Agent Memory? — IBM’s overview of agent memory types; an official study-guide reading for Domain 5.
  • langgraph-checkpoint-sqlite — the SQLite checkpointer package pinned in the lab (~=3.1).
  • NCP-AAI certification page — the official blueprint; Cognition, Planning, and Memory is weighted at 10%.