Agent Memory: State, Persistence, and Long-Term Recall (NCP-AAI Module 5)
This is Module 5 of NCP-AAI Mastery, a free 14-module course that takes you from your first agent to NVIDIA-certified. Start at Module 1 or browse the full syllabus.
Four minutes into a research run, Scout is doing everything right: the plan is solid, two of five steps are done, results are piling up. Then the process dies — a 429 that exhausts its retries, a dropped connection, your own Ctrl+C. Everything lived in RAM. The plan, the critique, the tool results: gone. You rerun from zero and pay every token a second time. Multiply by every crash and every flaky network, and “stateless agent” starts reading as “token bonfire with extra steps.”
There’s a second, quieter pain: every session, you tell Scout the same thing — concise bullets, English, cite primary sources — and it forgets you completely by the next run.
By the end of this module, kill -9 costs you nothing — Scout resumes
mid-run, exactly where it died — and it remembers your preferences from one
session to the next.
In this module
- You’ll learn:
- Distinguish short-term, long-term, and episodic memory — and map each to a concrete storage mechanism (graph state, checkpointer, store).
- Implement session persistence with a SQLite checkpointer: resume after a crash, inspect state history, time-travel to a past checkpoint.
- Build a cross-thread long-term store (preferences, covered topics) that the Planner reads to personalize research plans.
- Manage the context window — trimming, summarization, and the discipline of context engineering — and decide what to remember, where, and for how long.
- You’ll build: A Scout that survives crashes (SQLite checkpointer) and remembers user preferences across sessions (long-term store read by the Planner).
- Exam domains covered: D5 — Cognition, Planning, and Memory — 10% of the exam (this module covers the memory half; Module 4 covered planning and reasoning).
- Prerequisites: Modules 1–4 (your LangGraph
graph with the Planner node runs); NVIDIA API key configured, plus your Tavily
key from Module 2 — every research run in the lab calls
web_search.
Where you are
- ✅ Module 1 — What Is Agentic AI? — vocabulary, landscape, first NIM call
- ✅ Module 2 — Build Your First AI Agent — ReAct loop, tool calling, first graph
- ✅ Module 3 — Agent Architecture — patterns, trade-offs, Scout’s design doc
- ✅ Module 4 — Cognition — Planner node, reflection loop, reasoning budgets
- 👉 Module 5 — Agent Memory (you are here)
- ⬜ Modules 6–14 — RAG, multi-agent, evals, guardrails, deployment, the exam
Scout before: a single-graph agent with a Planner — and total amnesia: everything in RAM, a crash erases the run, no run knows any other existed. Scout after: every step of every run is persisted per thread, and a long-term memory of you — preferences, covered topics — survives across sessions and feeds the Planner.
Why your agent forgets everything: the memory stack
Start from the uncomfortable fact: the LLM is stateless. Every API call
starts from zero — the “conversation” is an illusion your own code maintains
by re-sending the full transcript with every call. You’ve done this since
Module 2: messages[] is the memory, and your code is its custodian. So
every kind of agent memory is engineering around the model, never inside
it. The question is never “does my agent remember” — it’s “which component
remembers this, and how long does it live?”
That question has a standard vocabulary, and the exam uses it precisely. Short-term memory is what the agent knows during one run: the graph state plus the message history inside the current context window — rich, exact, gone when the run ends. Long-term memory is whatever survives across sessions: anything deliberately written to durable storage. Within long-term memory, three flavors matter. Semantic memory stores facts — “this user wants concise English reports.” Episodic memory stores experiences — the trace of what the agent did and what happened, like “we researched the EU AI Act on June 10”; it’s the flavor behind objective 5.5, because an agent that adapts its behavior based on prior experiences needs a record of those experiences to consult. Procedural memory stores how to do things — for an LLM agent, the prompts and code themselves, versioned in git rather than in a database.
The “who stores what” map, which the exam tests in scenario form:
| Memory type | Scope | Lifetime | Storage mechanism in Scout | Example |
|---|---|---|---|---|
| Short-term (working) | One run / one thread | The run and its context window | ScoutState: question, messages, plan, plan_iterations | The tool results gathered so far in this research |
| Long-term — semantic | Cross-thread, per user | Until invalidated | Store, key preferences | ”Concise bullets, English, overview depth” |
| Long-term — episodic | Cross-thread, per user | Until purged | Store, key covered_topics | ”Researched the EU AI Act on 2026-06-10” |
| Long-term — procedural | Cross-thread, all users | Versioned with the code | System and planner prompts, node code | ”Always cite sources as full URLs” |
In Scout, this stack is three physical layers: state in RAM during a run, a checkpointer snapshotting that state to SQLite per thread, and a store holding cross-thread memory in the same SQLite file, namespaced per user — read by the Planner before it drafts a plan:
flowchart LR
subgraph st ["Short-term: one run"]
Q["ScoutState<br/>question · messages · plan · plan_iterations"]
end
subgraph ck ["Thread-scoped persistence"]
DB[("SQLite checkpointer<br/>one snapshot per graph step,<br/>keyed by thread_id")]
end
subgraph lt ["Cross-thread: long-term memory"]
KV[("Store<br/>namespace ('users', user_id)<br/>preferences · covered_topics")]
end
Q -->|"saved after every graph step"| DB
DB -->|"resume / time travel"| Q
PL["Planner node"] -->|reads| KV
RUN["end of run"] -->|writes| KV
Scout’s memory stack: state (in-run) → checkpointer (per thread) → store (across threads). The Planner reads the store; the checkpointer works underneath the graph, no reads required. The rest of the module walks down this stack, one layer at a time.
Short-term memory and context hygiene
The context window is a finite, billed resource — and an agent loop is a
machine for filling it. Every iteration appends an assistant turn and tool
observations to messages[], and every iteration re-sends all of it: a
12-turn run doesn’t bill 12 turns, it bills roughly the sum of 1 through 12.
Quality degrades too. Models attend best to the start and end of long
prompts and lose material parked in the middle — the “lost in the middle”
effect. I won’t pin a number on where degradation starts, but I’ve watched
runs get worse as their transcripts got longer, and so will you.
So the transcript needs hygiene. Three standard strategies: trim (keep the system prompt, the question, and the last N messages — brutal and free); summarization — compressing older history into a rolling summary that replaces the messages it covers, one extra LLM call for the gist of everything without the tokens of everything; and selection — include only what’s relevant to the current step, the retrieval mindset Module 6 builds machinery for.
| Strategy | Cost | Fidelity | Latency | When to use |
|---|---|---|---|---|
| Full history | Highest — the whole transcript re-billed every turn | Perfect, until the window overflows | Grows every turn | Short runs; debugging, where you want everything |
| Trim (keep last N) | Lowest | Recent detail exact; older context gone abruptly | None added | Long tool loops where only recent context matters — Scout’s executor |
| Rolling summarization | One extra LLM call per compression | Good gist, lossy on specifics | One call’s worth | Long conversations where early decisions must stay in view |
The name for this discipline is context engineering — treating what enters the model’s context as a deliberate design decision, with a budget, rather than letting history accumulate by default. In an agent, most of the prompt isn’t written by you — it’s accreted by the loop — so deciding what gets in is as much design as the system prompt itself. Scout’s lab applies the cheapest tool, a trim bound before every executor call, and leaves rolling summarization as your exercise.
One boundary to keep sharp, because the exam probes it: the context window is short-term memory’s ceiling, not a long-term mechanism. A million-token window doesn’t remember last week — it makes this run’s working memory bigger, at this run’s prices.
Persistence: checkpointers, threads, and time travel
Everything so far lives and dies with the process. The fix is the checkpointer — a persistence layer that saves a snapshot of the full graph state after every super-step (one round of graph execution: a node — or several running in parallel — finishes and its state update is applied), so a run can be resumed, inspected, or replayed instead of restarted. In LangGraph you attach one at compile time; from then on, every state transition is written through before the next node runs. Our storage is SQLite — one local file, zero infrastructure, the same mental model as the Postgres checkpointer you’d run in production.
Checkpoints are organized by thread — one persisted sequence of graph
runs identified by a thread_id, passed at invocation through
config["configurable"]. The thread is the unit of “same conversation”:
invoke twice with the same thread_id and the second call continues from
the first’s final state; a fresh thread_id is a blank slate. Note what
this implies — identity lives in the config, not the state. ScoutState
gains no fields in this module, none: persistence is infrastructure
underneath the graph, not data inside it.
What a checkpointer buys you:
- Crash recovery. The run that died at step 3 of 5 resumes at step 3 of 5; already-paid LLM calls are read back from disk, not re-billed. This is the heart of objective 5.4 — stateful orchestration: coordinating multi-step work whose state outlives any single process.
- Multi-turn sessions. A follow-up can land on the same thread and see the full prior context — across process restarts, not just within one. (That’s what the machinery supports; Scout’s CLI uses the shared thread for crash recovery only and doesn’t take a second question on a finished thread — a conversational front-end would build on exactly this.)
- Time travel — reading a thread’s checkpoint history and resuming from any past checkpoint, forking the run from that point. It turns “why did the run derail at step 4?” from archaeology into a replay.
The human-in-the-loop interrupts covered in Module 9 — pause the graph, let a human approve the plan, resume — are built directly on this checkpoint machinery.
Here’s the kill-and-resume mechanic you’ll run in the lab:
sequenceDiagram
participant U as You
participant G as Scout's graph
participant C as SQLite checkpointer
U->>G: run.py "question" --thread report-42
G->>C: checkpoint (plan v1)
G->>C: checkpoint (critique)
G->>C: checkpoint (plan v2)
Note over G: Ctrl+C — the process dies
U->>G: run.py --thread report-42
G->>C: load latest checkpoint
C-->>G: plan v2, executor pending
Note over G: resumes at the executor —<br/>3 planning calls NOT re-paid
Now, the distinction this module exists to teach. A checkpointer is
thread-scoped: it persists one thread’s state, addressable only by that
thread_id. Cross-session memory — “remember this user prefers bullet
points, whatever thread they open tomorrow” — needs a different scope: the
store, covered next. Side by side:
| Checkpointer | Store | |
|---|---|---|
| Scope | One thread (thread_id) | Cross-thread (namespace, e.g. ("users", user_id)) |
| What it saves | The full graph state, automatically, every super-step | Key-value documents you choose to write |
| Typical content | Transcript, plan, intermediate results | Preferences, user facts, episodes |
| Lifetime | The session and its replay history | Across sessions, until invalidated |
| Scout usage | Kill-and-resume, time travel | Planner personalization, covered topics |
Long-term memory: profiles, facts, and the store
The store is the cross-thread half: a namespaced key-value memory that
nodes and application code read and write deliberately, organized as
(namespace, key) → document. The namespace tuple — ours is
("users", user_id) — is the isolation boundary: one user’s memories are
unreachable from another’s namespace by construction.
What earns a place in Scout’s store? Two keys per user:
preferences— semantic memory: report style, depth, language. Written rarely, read by the Planner on every run.covered_topics— light episodic memory:{topic, date}entries, one per completed run. The Planner reads it to avoid re-planning covered ground — “we researched X on June 10; extend it, don’t repeat it.” That’s objective 5.5 in working clothes: behavior adapted from recorded experience.
When to write is a design decision. Writing in the hot path — inside a
node, during the run — guarantees the memory lands, but adds latency and a
failure mode to every run; writing in the background keeps runs lean but can
lose the write if you crash first. Scout writes covered_topics after the
graph finishes: a lost topic costs one duplicated search someday; a fragile
hot path costs every run.
One mention for completeness: vector memory — storing memories as embeddings (numeric vectors that encode meaning, so similar texts land near each other) so the agent retrieves them by semantic similarity instead of by exact key. It’s the right tool once memories number in the thousands and “which ones are relevant?” becomes a search problem. We build exactly that machinery in Module 6 — for documents first; pointing it at memories is the same trick.
Deciding what to remember
The lab gives you mechanisms; production gives you choices. A workable memory policy answers three questions per item. What: store conclusions and stable facts, not raw transcripts — checkpoints already keep the raw material, and stale detail misleads the Planner. Where: thread-scoped context goes to the checkpointer automatically; only what must cross sessions earns a store write. How long: every memory needs an expiry or an invalidation trigger, because a wrong remembered “fact” is worse than no memory at all. And the moment memories describe people, minimization applies — store the least you need, purge on request; the compliance side arrives with Module 9.
Hands-on lab: build it
Objective: give Scout session persistence (a SQLite checkpointer) and a
long-term store of preferences and covered topics, read by the Planner. The
full code lives in
module-05/
of the labs repo.
Observable result: kill a run mid-flight and resume it for free;
uv run python -m scout.memory --show (from module-05/) prints what Scout
knows about you;
a second research run’s plan acknowledges the topic the first one covered.
One new dependency — the only one this module (from the repo root):
uv add "langgraph-checkpoint-sqlite~=3.1"
Step 1 — Plug in the checkpointer
The new scout/memory.py is the whole memory layer. The checkpointer
factory builds a SqliteSaver on a plain sqlite3 connection — not the
from_conn_string context manager, which would close the saver when the
with block exits; the CLI needs it alive for the entire run:
# module-05/scout/memory.py (excerpt)
from langgraph.checkpoint.serde.jsonplus import JsonPlusSerializer
from langgraph.checkpoint.sqlite import SqliteSaver
MODULE_DIR = Path(__file__).resolve().parents[1]
DB_PATH = MODULE_DIR / "scout_memory.db" # gitignored: memories are personal
def get_checkpointer(db_path: Path | None = None) -> SqliteSaver:
connection = sqlite3.connect(db_path or DB_PATH, check_same_thread=False)
serde = JsonPlusSerializer(
allowed_msgpack_modules=[("scout.planner", "ResearchPlan")]
)
return SqliteSaver(connection, serde=serde)
The serializer’s msgpack allowlist — msgpack being the compact binary format
checkpoints are serialized in — names ResearchPlan explicitly: ScoutState.plan
is a Pydantic model, and langgraph-checkpoint (4.x in our lockfile) warns
on — and, with LANGGRAPH_STRICT_MSGPACK=true, blocks — deserializing
custom types it was not told to trust.
graph.py changes in exactly two places — the signature and the compile
call. Topology untouched:
from langgraph.checkpoint.base import BaseCheckpointSaver
def build_graph(checkpointer: BaseCheckpointSaver | None = None):
builder = StateGraph(ScoutState)
# ... nodes and edges exactly as in module 04 ...
return builder.compile(checkpointer=checkpointer)
And run.py threads every invocation through a thread id:
graph = build_graph(checkpointer=memory.get_checkpointer())
run_config = {"configurable": {"thread_id": thread_id, "user_id": args.user}}
# ... the resume check (step 2) decides payload ...
for mode, chunk in graph.stream(payload, run_config, stream_mode=["updates", "custom"]):
That’s the entire integration: from here, LangGraph checkpoints every super-step without another line from you.
Step 2 — Kill it, resume it
Start a run with an explicit thread, and kill it once the executor starts searching:
cd module-05
uv run python -m scout.run "What is the Nemotron Coalition that NVIDIA announced at GTC 2026?" --thread nemotron-gtc26
[thread] nemotron-gtc26 [user] default
=== Plan v1 (draft) ===
Objective: Determine the purpose, composition, and announced initiatives of the Nemotron Coalition unveiled by NVIDIA at GTC 2026.
1. Locate the official announcement of the Nemotron Coalition from GTC 2026.
queries: NVIDIA GTC 2026 Nemotron Coalition announcement, ...
...
=== Critique ===
Plan critique (address every point in your revision):
- No step explicitly addresses the broader context or significance of the Coalition ...
- The plan omits any mechanism for evaluating the credibility of the announcement ...
=== Plan v2 (revised) ===
Objective: Document the purpose, composition, announced initiatives, and strategic significance of the Nemotron Coalition unveiled by NVIDIA at GTC 2026, confirming its official announcement and credibility.
...
^C
KeyboardInterrupt
Three planning calls, paid and checkpointed. Relaunch with the same thread and no question:
uv run python -m scout.run --thread nemotron-gtc26
[thread] nemotron-gtc26 [user] default
[resume] picking up at: agent — nothing already paid is re-paid
[agent] tool_call: web_search({"query": "NVIDIA Nemotron Coalition GTC 2026 press release"})
[tools] 5 results
[agent] tool_call: web_search({"query": "NVIDIA press release Nemotron Coalition GTC 2026 site:nvidia.com"})
[tools] 5 results
[agent] tool_call: web_search({"query": "Nemotron Coalition purpose statement NVIDIA site:nvidianews.nvidia.com"})
[tools] 5 results
**Nature of the coalition**
NVIDIA described the **Nemotron Coalition** as "a global collaboration between
open-model builders and AI developers ..." ... (cited answer)
[memory] recorded in covered_topics for user 'default'
The resume logic in run.py is four lines — and one subtlety worth
memorizing:
snapshot = graph.get_state(run_config)
if snapshot.tasks:
# Interrupted thread: resume from the last checkpoint. The input must
# be None — passing the question again would APPEND to the saved state.
payload = None
get_state(...) returns the tasks that were pending when the process died —
snapshot.next names them, and checking snapshot.tasks also catches a kill
that landed after a node’s result was saved but before its super-step
committed (that saved result is replayed, not re-paid). Invoking with None
means “continue from the checkpoint”; re-passing the question would append
a duplicate to the persisted transcript, not replace it. One precision on
the resume banner: “nothing already paid” means nothing checkpointed — the
executor call that was mid-flight when you hit Ctrl+C was never saved, and
restarts from zero.
Step 3 — Time travel
Every checkpoint of a thread is readable — the CLI below wraps
graph.get_state_history(config):
uv run python -m scout.memory --history nemotron-gtc26
1f165325-...-800a next=END messages=11
1f165324-...-8009 next=agent messages=10
1f165324-...-8008 next=tools messages=9
...
1f165323-...-8003 next=agent messages=4
1f165323-...-8002 next=planner messages=3
1f165323-...-8001 next=critic messages=2
1f165322-...-8000 next=planner messages=2
1f165322-...-bfff next=__start__ messages=0
To replay from any point, pass that checkpoint’s id alongside the thread id
— LangGraph forks the thread from there (runnable in a REPL, uv run python
from module-05/):
from scout import memory
from scout.graph import build_graph
graph = build_graph(checkpointer=memory.get_checkpointer())
config = {"configurable": {"thread_id": "nemotron-gtc26", "checkpoint_id": "<id>"}}
graph.invoke(None, config) # replays from AFTER that checkpoint
This is your debugging superpower for the rest of the course: rewind to the checkpoint before the derailment and replay — no re-paying earlier steps, no praying the failure reproduces.
Step 4 — The long-term store
langgraph-checkpoint-sqlite 3.1 ships a ready-made SqliteStore too — but
we build our own in ~30 lines, because thirty lines of SQLite teach the
interface better than an import. It mirrors LangGraph’s BaseStore contract
(put/get against a namespace tuple), so swapping in the shipped store —
or a managed one — later touches no caller:
class ScoutStore:
"""Cross-thread key-value memory: (namespace tuple, key) -> JSON dict."""
def __init__(self, db_path: Path | None = None) -> None:
self._conn = sqlite3.connect(db_path or DB_PATH, check_same_thread=False)
self._conn.execute(
"CREATE TABLE IF NOT EXISTS store ("
" namespace TEXT NOT NULL, key TEXT NOT NULL, value TEXT NOT NULL,"
" PRIMARY KEY (namespace, key))"
)
self._conn.commit()
def put(self, namespace: tuple[str, ...], key: str, value: dict) -> None:
self._conn.execute(
"INSERT OR REPLACE INTO store (namespace, key, value) VALUES (?, ?, ?)",
("/".join(namespace), key, json.dumps(value)),
)
self._conn.commit()
def get(self, namespace: tuple[str, ...], key: str) -> dict | None:
row = self._conn.execute(
"SELECT value FROM store WHERE namespace = ? AND key = ?",
("/".join(namespace), key),
).fetchone()
return json.loads(row[0]) if row else None
Same SQLite file as the checkpointer, different table, different scope. Inspect and edit it from the CLI:
uv run python -m scout.memory --show
uv run python -m scout.memory --set style "deep dive with code samples"
Step 5 — The Planner reads memory; the run writes it
The read side: planner.py opens every planning prompt with what the store
knows — preferences, and recent covered topics with an instruction to build
on them instead of repeating them:
def _memory_context(user_id: str) -> str:
store = memory.get_store()
preferences = memory.read_preferences(store, user_id)
lines = ["User preferences (tailor the plan to these):"]
lines += [f"- {field}: {value}" for field, value in preferences.items()]
topics = memory.read_covered_topics(store, user_id)
if topics:
lines.append(
"Topics already covered in past sessions. Do not re-plan them "
"from scratch: if the question overlaps one, SAY SO in the "
"plan's objective and focus the steps on what is new:"
)
lines += [f"- {entry['topic']} (covered {entry['date']})" for entry in topics[-5:]]
return "\n".join(lines)
The user_id comes from config["configurable"] — the same channel as
thread_id, and just as deliberately not a state field. To receive it,
planner_node gains a second parameter that LangGraph injects into any node
that declares it:
from langchain_core.runnables import RunnableConfig
def planner_node(state: "ScoutState", config: RunnableConfig) -> dict:
user_id = config["configurable"].get("user_id", "default")
The parameter must be named config (typed RunnableConfig) for the
injection to happen — and inside the function it shadows the imported
config module, which is why planner.py pulls MAX_PLAN_ITERATIONS in as
a bare name.
The write side is one line in run.py, after the graph completes — off the
hot path, as argued above:
memory.record_covered_topic(memory.get_store(), args.user, final["question"])
Close the loop: run two neighboring questions on two different threads and watch the second plan acknowledge the first run’s ground. Two threads, one memory — the cross-thread scope doing its job.
Step 6 — Context hygiene: trim before every call
The last increment is ten lines in memory.py (KEEP_LAST_MESSAGES = 12),
applied in one line in the executor: bound what the model sees without
touching what the state keeps:
def trim_transcript(messages: list[dict], keep_last: int = KEEP_LAST_MESSAGES) -> list[dict]:
"""System prompt + question + the last keep_last messages. Never start
the kept tail on a tool observation — its parent assistant turn must
stay in view, or the API rejects the transcript."""
if len(messages) <= keep_last + 2:
return messages
head, tail = messages[:2], messages[-keep_last:]
while tail and tail[0]["role"] == "tool":
tail = tail[1:]
return head + tail
The guard at the end matters: a kept tail that opens on a tool
observation with no parent tool_calls turn is a protocol violation the API
rejects. Trimming is easy; trimming without breaking the tool-call pairing
is the actual skill.
One trade-off to know about: the protected head is [system, question]
only, and the plan hand-off message (“Follow this research plan…”) sits at
index 3. With the course budget of at most 3 searches the plan stays in
view — but that budget is a prompt instruction, not a hard limit. A run that
ignores it and spends the real cap of MAX_ITERATIONS = 6 turns (some
carrying several parallel tool calls) can already push the hand-off out of
the window. Widen KEEP_LAST_MESSAGES (or protect the hand-off message) so
the plan you paid three deliberation calls for doesn’t silently fall out.
Run the suite — and the cumulative rule, as always (from the repo root):
uv run pytest module-05/tests/
uv run pytest module-01/tests/ module-02/tests/ module-03/tests/ module-04/tests/ module-05/tests/
Try it yourself (no solution provided):
- Preferred sources. Add a
preferred_sourcespreference (say,"arxiv.org, official docs") and make the Planner fold it into each step’ssearch_queries— one edit in_memory_context(), one in the planner prompt. - Rolling summary. When the transcript exceeds N turns, compress the middle with one summarization call (“compress into a paragraph, keep all URLs”) instead of dropping it. Compare token spend and answer quality against plain trimming — fidelity vs. cost, measured.
Exam corner
What the exam tests here. Per the official study guide, Domain 5 (10%) expects you to: implement memory mechanisms for short- and long-term context retention (5.1 — this module’s taxonomy and its three storage layers); manage stateful orchestration to coordinate complex tasks and knowledge retention (5.4 — threads, crash recovery, time travel); and adapt reasoning strategies based on prior experiences (5.5 — episodic memory feeding the Planner; the reflection angle was Module 4’s). Note also objective 1.4 in Domain 1 — “manage short-term and long-term memory for context retention” — nearly word-for-word the same skill: memory questions can pay you twice across two domains.
Quiz — answers after question 5.
-
A support agent persists conversations with a checkpointer. Users complain that preferences they state on Monday are gone when they open a new conversation on Tuesday. What’s missing?
- A) A larger context window, so the preferences stay in the prompt
- B) A cross-thread store keyed by user — checkpointers are thread-scoped, and a new conversation is a new thread
- C) More frequent checkpoints, so the preference is captured sooner
- D) A higher
MAX_ITERATIONS, so the conversation can continue longer
-
A document-processing agent runs 40-minute multi-step jobs. After a crash at step 7 of 9, the business requirement is “do not redo the first six steps.” Which mechanism delivers that?
- A) Retry logic with exponential backoff around each API call
- B) A longer system prompt instructing the model to be more careful
- C) A checkpointer persisting graph state every super-step, with the job resumed on its existing thread
- D) Lowering temperature so the run fails less often
-
An agent should adapt its approach based on what happened in previous runs — for example, avoiding a data source that failed twice last week. Which memory type stores that signal?
- A) Short-term memory — keep all past runs in the context window
- B) Semantic memory — store the fact “source X is unreliable” as a preference
- C) Episodic memory — a record of past runs and their outcomes, consulted before acting
- D) Procedural memory — retrain the prompts after every run
-
A long-running assistant conversation is degrading: answers are slower, costs climb every turn, and the model misses instructions given early on. Best remediation?
- A) Increase
max_tokensso the model has room for every message - B) Summarize older history into a rolling summary and keep recent turns verbatim
- C) Clear the entire history every turn for a clean slate
- D) Switch to a model with a bigger context window and keep everything
- A) Increase
-
A run produced a wrong report, and you suspect step 4 of the plan went sideways. You want to re-execute from just before step 4 — without paying for steps 1–3 again and without losing the original run. What do you use?
- A) Re-run the whole job with the same seed and watch step 4 closely
- B) The thread’s state history: fork from the checkpoint preceding step 4 and replay forward
- C) Grep the application logs and reconstruct the state by hand
- D) Delete the thread and start over with a more detailed plan
Answers.
1 — B. “New conversation” means new thread, and a checkpointer’s memory
ends at the thread boundary. A keeps the preference only within one thread’s
transcript anyway; C misunderstands the scope problem — frequency doesn’t
cross threads; D is unrelated.
2 — C. “Don’t redo completed steps after a crash” is the checkpointer’s
defining feature — persisted state, resumed on the same thread. A retries a
call, not a job: it can’t recover work after the process dies. B and D
reduce nothing about crash loss.
3 — C. “What happened in previous runs” is the definition of episodic
memory. B has a kernel of truth — repeated episodes may later be distilled
into a semantic fact — but the signal itself (“failed twice last week”) is a
record of experiences. A doesn’t survive sessions; D confuses prompts with
run history.
4 — B. Slower + costlier + “lost in the middle” is unmanaged transcript
growth; rolling summarization keeps the gist and caps the size. A spends
more on output, fixing nothing about the bloated input; C destroys the
context the conversation needs; D pays more to delay the same degradation.
5 — B. This is time travel: get_state_history to find the checkpoint,
fork from it, replay forward — the original thread stays intact. A re-pays
everything and LLM runs aren’t reproducible by seed alone; C reconstructs
state the checkpointer already has; D throws away the evidence.
Traps to avoid:
- “The LLM remembers the conversation.” It doesn’t. The model is stateless; every appearance of memory is engineering around it — state, checkpointer, store. Questions that personify model memory test whether you know where memory actually lives.
- Checkpointer ≡ long-term memory. The trap of this module. Thread-scoped vs. cross-thread is the distinction; if the scenario crosses a session or user boundary, a checkpointer alone is the wrong answer.
- “More context is always better.” Context is a budget: cost and latency scale with it, and quality can drop as relevant material drowns in the middle. A bigger window is not a memory strategy.
Key takeaways
- The LLM is stateless: every memory your agent has is a component you built — graph state, checkpointer, or store.
- Short-term memory is the state and transcript of one run; long-term memory is whatever you deliberately persist across sessions — semantic (facts), episodic (experiences), procedural (prompts and code).
- A checkpointer persists full graph state per thread: crash recovery,
multi-turn sessions, time travel. Identity (
thread_id,user_id) travels inconfig["configurable"], never in the state schema. - A store is the cross-thread half: namespaced key-value memory the Planner reads to personalize plans. Checkpointer vs. store — thread vs. cross-thread — is the distinction Domain 5 leans on hardest.
- Context engineering treats the context window as a budget: trim or summarize by design, because “keep everything” degrades quality while costing the most.
- What to remember, where, and for how long is a design decision — store conclusions, not transcripts; give every memory an expiry; minimize anything personal.
Keep going
Want the full NCP-AAI question bank (150+ exam-style questions) and the next module in your inbox? Subscribe here — it’s free, like everything in this series.
Scout remembers you now — next, we give it a library: ingesting sources into a vector store and answering with citations.
Lab code · Course index · ← Module 4 · Module 6 →
References
- LangGraph: Persistence —
the checkpointer, threads,
get_state_history, and time travel, in the current 1.x docs. - LangGraph: Memory — short-term vs. long-term memory, the store and its namespaces, and the semantic/episodic/procedural framing used in this module.
- What Is AI Agent Memory? — IBM’s overview of agent memory types; an official study-guide reading for Domain 5.
- langgraph-checkpoint-sqlite —
the SQLite checkpointer package pinned in the lab (
~=3.1). - NCP-AAI certification page — the official blueprint; Cognition, Planning, and Memory is weighted at 10%.