A diagnostic framework for memory systems — RAG, knowledge graphs, personal knowledge bases, conversational memory — that tests what the system knows about its own structure, not just whether it can retrieve memories.
"Multipass!" — Leeloo, The Fifth Element. The name is a nod to a joke from an earlier MemPalace issue thread (visual reference above), and also to what the framework actually does: multiple passes over every memory system under test, across multiple corpus shapes and multiple retrieval conditions (A / B / C), so brittle default behaviours that hide on any single pass become visible when the readings are compared side by side.
What this is · Status · Install · Corpora · Next steps · Adapters
See the nine-category menu for what each test measures and which to run for your setup.
Standard memory benchmarks (LongMemEval, LoCoMo, MINE, GraphRAG-Bench, BEAM) ask "can you find a memory?" That's necessary but not sufficient. A filing cabinet can find a memory. The question is what the structure of a memory system gives you beyond retrieval — and whether your specific build, under your specific harness, with your specific model, actually uses it.
SME defines a nine-category test menu. Categories 1–8 measure graph structure and offline retrieval. Category 9 (The Handshake) measures harness integration — whether the model actually reaches the memory when it runs in production. Cats 1–8 are where every published benchmark stops; Cat 9 is the gap every deployment engineer runs into.
Each category has a Cat N identifier (for code) and a descriptive
"palace-nod" name (The Lookup, The Stairway, The Blueprint, The
Handshake, and so on) so readable output doesn't require a lookup
table.
Beta-level instrumentation, actively evolving (v0.2.0). Ten adapters
(flat, mempalace, mempalace-daemon, familiar, rlm, longhand,
ladybugdb, full-context, karpathy-compiled, plus the SMEAdapter
ABC template), nine CLI commands (retrieve, analyze, cat8, cat2c,
cat4, cat5, check, cat9, compile-wiki), two evaluation corpora
(jp-realm-v0.1 and
good-dog-corpus), a
LongMemEval cross-validation harness,
Karpathy-baseline conditions D1/D2 (full-corpus-in-context and
LLM-compiled wiki), B-Cubed scoring for alias resolution (Cat 4a), and a
specification for the remaining categories. Diagnostic posture, not
benchmark — the defensible findings are before/after deltas under
identical conditions and within-system A/B/C/D ablations. See the
spec and the onboarding guide
for the full honest-limitations discussion.
pip install -e .
# Optional extras:
pip install -e ".[topology]" # Ripser + python-louvain (for gap detection)
pip install -e ".[ladybugdb]" # LadybugDB adapter
pip install -e ".[dev]" # pytest, ruffInstalls as the Python package sme-eval with CLI entrypoint
sme-eval. The GitHub repo is multipass-structural-memory-eval;
the acronym SME (Structural Memory Evaluation) is used throughout
the documentation and code.
Quick start: run your first diagnostic in 5 minutes with the onboarding guide. Need the spec? Start at docs/sme_spec_v8.md.
SME ships two evaluation corpora and supports loading a third (LongMemEval) for cross-validation:
-
jp-realm-v0.1— 30 questions against a personal knowledge palace (tech-domain, biographical). The original development corpus. Baseline readings forfamiliar,mempalace-daemon, andrlmadapters live inbaselines/. -
good-dog-corpus— 24 notes across 6 domains (veterinary research, municipal policy, breed standards, nutrition safety, behavioral research, community journalism). Non-technical, real-world, ontology-first. Designed to stress-test alias resolution, contradiction detection, and temporal supersession. Ships with a full ontology design narrative explaining every schema decision. See the good-dog-corpus README. -
LongMemEval — loader for the 500- question LongMemEval-cleaned dataset (Wu et al., ICLR 2025) with a primary-source-verified category mapping to SME. Used for cross-validation of SME's scoring against the field's most-cited benchmark.
The multi-corpus methodology is load-bearing: a single corpus shape gives misleading conclusions because brittle default behaviours hide on any single retrieval profile. See the onboarding guide for the full argument.
-
docs/ideas.md— onboarding guide. Start here if you want to run SME against your own memory system. Covers the nine-category menu, how to write an adapter for your backend, how to write a corpus from your own content, how to run the implemented categories, and how to read what comes out the other end. This is also where the methodology framing lives — why A/B/C isolation matters, why multi-corpus testing is load-bearing, and why "the delta is the product, the levels are decoration." -
docs/sme_spec_v8.md— full specification. Precise category-by-category definitions, metric formulas, adapter interface contract, topology layer details, and the Cat 9 (The Handshake) harness-integration spec. Reference material — read the onboarding guide first if you want to get a test run going. -
docs/cross_validation_2026.md— current work. Cross-validation of SME categories against LongMemEval / MemoryBench, Karpathy-condition D baselines (full- corpus-in-context), and first readings from the live benchmark harness. Active development; this is where near-term SME findings land. -
docs/industry_standards_integration.md— integration audit. Survey of where SME rolls its own vs. where battle-tested standards exist (SHACL, PROV-O, OpenLineage, B-Cubed, Ripser). Constitutional principle: SME stays lightweight and locally runnable — no server hosting required. -
docs/ingestigation.md— Cat 4 deep dive. Renames and re-scopes Category 4 with a primary-source-verified survey of existing tools (SHACL, W3C PROV-O, ProVe, Splink, OpenLineage, Great Expectations) and proposed sub-test additions.
SME ships adapters for several memory systems. Each adapter teaches
the framework to speak the wire protocol of a specific system so the
same eval questions can run across multiple backends. Adapters live in
sme/adapters/ and implement the SMEAdapter ABC.
mempalace-daemon — by jphein
sme/adapters/mempalace_daemon.py talks to a running
palace-daemon over HTTP —
by jphein. No filesystem access, no
ChromaDB import, no shared-process constraint with the daemon. Use
this adapter when MemPalace is fronted by the daemon (the daemon is
the single writer to the palace) — the existing mempalace adapter
is still correct for single-process upstream installs without the
daemon.
Wired endpoints:
query()→GET /search?q=…&kind=…&limit=…withX-API-Key. Defaultkind="content"excludes Stop-hook auto-save checkpoints; pass--kind allto disable. Daemon-sidewarnings(e.g. broken HNSW index) are surfaced intoQueryResult.errorasWARN: …so Cat 9 scoring can distinguish flagged retrieval from clean retrieval.get_graph_snapshot()→ triesGET /graphfirst (palace-daemon ≥1.6.0); on 404, falls back to walkingmempalace_list_wings,mempalace_list_roomsper wing, andmempalace_list_tunnelsviaPOST /mcp. The MCP fallback is slower (~30s on a 151K-drawer palace) but works against any palace-daemon version.
Auth resolution: explicit --api-url / --api-key flags →
~/.config/palace-daemon/env (PALACE_DAEMON_URL, PALACE_API_KEY)
→ process environment.
Invocation:
# With explicit daemon URL
sme-eval retrieve --adapter mempalace-daemon \
--api-url http://your-daemon:8085 \
--questions corpus.yaml \
--kind content \
--json out.json
# Or, if ~/.config/palace-daemon/env is populated, no flags needed
sme-eval retrieve --adapter mempalace-daemon --questions corpus.yamlThe same --api-url / --api-key / --kind flags work on the
cat4, cat5, and check subcommands.
Why this matters: the engram-2 critique ("0.984 R@5 but 17% E2E
QA accuracy") is about the integration-under-production-model slice
that Cat 9 measures. Running SME's retrieve through the daemon
surfaces exactly the kind of gap that critique describes — the
adapter's WARN-soft-error treatment means the framework records
"retrieval ran but the daemon flagged it as degraded" as a first-
class signal, not as a hard failure that hides the issue.
For users running upstream MemPalace without palace-daemon (the
default install pattern), the existing mempalace adapter is
correct — single process, no daemon, direct ChromaDB access is
fine. The daemon adapter is additive, for users who've adopted
palace-daemon's single-writer architecture.
familiar — by jphein
familiar.realm.watch
is a retrieval pipeline that wraps palace-daemon with reranking,
temporal decay, extractive compression, and grounding directives.
jphein built it; sme/adapters/familiar.py
lets SME measure its full end-to-end contribution on top of the raw
daemon. The sibling mempalace-daemon adapter measures palace alone —
running both on the same corpus shows what the pipeline layer adds.
Wired endpoints:
query()→POST /api/familiar/evalwith body{query, limit, kind, mock}. Familiar's eval endpoint already returns SME-shape{answer, context_string, retrieved_entities, retrieved_edges, error, warnings, available_in_scope}natively (it was designed against the SME contract), so the adapter is mostly deserialization with the same WARN: error-prefix translation asmempalace-daemon.get_graph_snapshot()→GET /api/familiar/graph. Familiar proxies palace-daemon's/graphwith a 5-minute server-side cache; payload mapping reusessme/adapters/_graph_mapping.pyshared withmempalace-daemon.get_harness_manifest()→ forward-compat for Cat 9. Returns[ToolCall, MCPResource]oncesme.harnessships;[]until then.
Determinism: --mock (default) skips LLM inference so Cat 1
substring scoring is reproducible across runs. Use --no-mock to
include the model output in the per-question record (intended for
future Cat 9 work).
Invocation:
# Default: --mock for Cat 1 determinism
sme-eval retrieve --adapter familiar --api-url https://your-familiar-host --questions corpus.yaml --json familiar.json
# Compare against the same palace via the daemon adapter
sme-eval retrieve --adapter mempalace-daemon --api-url http://your-daemon:8085 --questions corpus.yaml --json daemon.json
# The score delta = what familiar's v0.2 pipeline is worthThe --api-url, --mock/--no-mock, and --familiar-timeout flags
work on cat4, cat5, check, and retrieve subcommands.
rlm — by jphein
sme/adapters/rlm_adapter.py treats RLM
(a fork of alexzhang13/rlm) as
the read-side orchestrator rather than a deterministic retrieval
pipeline. The LLM itself decides when to call mempalace_search,
with what queries, and how to compose results. familiar's pipeline
is the baseline this adapter is benchmarked against, not the thing
it replaces.
Design: RLM gets mempalace_search registered as a custom_tools
callable. The adapter wraps that callable to capture every search
result into a per-query buffer; after rlm.completion() returns, the
buffer's contents become context_string (in tool-call order) and
retrieved_entities (one Entity per drawer). Same scoring contract
as every other adapter.
Endpoint override: RLM_BASE_URL / RLM_MODEL / RLM_API_KEY
env vars point the openai backend at any compatible endpoint --
local llama.cpp, hosted Llama 3.3 70B, anything OpenAI-shaped --
without touching the cloud-chat-assistant config-file fallback path.
First two live readings on jp-realm-v0.1 (30 questions):
| Run | Mean recall | Tool-call distribution |
|---|---|---|
| rlm + Qwen 2.5 7B Q5_K_M | 46.67% | 25/30 zero-call, 2/30 used tool |
| rlm + Llama 3.3 70B | 46.67% | 22/30 zero-call, 8/30 used tool |
| familiar v0.3.9 (deterministic) | 78.33% | n/a |
Both RLM runs land at the same aggregate recall despite a 4x difference in tool-invocation rate -- they ceiling at the orchestrator's willingness to invoke the tool, not at retrieval quality. This is the data behind the 9a invocation-rate issue filed upstream. See the onboarding guide for the full discussion and the per-question deltas.
Invocation:
RLM_BASE_URL=https://your-endpoint RLM_MODEL=llama-3.3-70b RLM_API_KEY=... \
PALACE_DAEMON_URL=http://your-daemon:8085 PALACE_API_KEY=... \
sme-eval retrieve --adapter rlm \
--questions sme/corpora/jp_realm_v0_1/questions.yaml \
--json baselines/rlm_$(date +%Y%m%d).jsonsme/adapters/longhand.py measures
Longhand (by Nate Nelson), a
persistent local memory server for Claude Code. Longhand reads the raw
session JSONL that Claude Code already writes
(~/.claude/projects/<project>/<session-id>.jsonl) and indexes it
locally into SQLite (verbatim source of truth) plus ChromaDB (vector
search) under ~/.longhand/ — no network, no API calls. Like MemPalace
it stores exact words rather than letting a model decide what matters,
which puts it in the same verbatim-first cohort.
Daemon-strict by design. The adapter shells out to the longhand
CLI (longhand search --json) rather than opening Longhand's ChromaDB
or SQLite directly — the same single-writer discipline as
mempalace-daemon's HTTP-only access. A second process holding handles
to a store Longhand assumes it owns is exactly the failure mode the
daemon adapters exist to avoid.
Shape:
query()→longhand search <q> --json --limit N(optional--project). Tolerates either a bare-list or{results: [...]}JSON shape; CLI/timeout/parse failures come back as aQueryResult.error(CLI_ERROR/TIMEOUT/BAD_JSON/NO_RESULTS) so Cat 9 scoring can tell a failed call-through from an empty store.get_graph_snapshot()→([], []). Longhand is a verbatim session archive, not a knowledge graph, so structural categories (Cat 4/5/8) are not meaningful against it; the retrieval categories (Cat 1/2c/3/6) and Cat 9 are.ingest_corpus()→ not implemented. Longhand ingests Claude Code sessions through its own hooks, not arbitrary seeded corpora — this is a diagnostic-only (Mode B) adapter, likemempalace-daemon.get_harness_manifest()→ declares Longhand's MCPsearchtool as a Cat 9 surface; the probe exercises the same SQLite+Chroma read path the MCP tool uses.
Invocation:
# Resolves `longhand` on PATH by default
sme-eval retrieve --adapter longhand \
--questions corpus.yaml \
--json longhand.jsonThe adapter resolves the longhand binary on PATH. To point at a
specific binary or scope to a single project, construct
LonghandAdapter(bin_path=..., project=...) directly — the registry
exposes bin_path, home_dir, n_results, timeout_s, and project
as constructor kwargs.
sme/conditions/full_context.py concatenates every .md file under a
vault directory and returns that as the query's context_string. No
retrieval, no graph, no index. This is the deliberate-floor baseline
answering the question: at what corpus size does structured retrieval
start outperforming flat context-window retrieval?
Structural categories (Cat 4/5/8) are not meaningful here — there is no graph. Retrieval categories (Cat 1/2c/3/6) produce maximum-recall, maximum-token-cost readings since the entire corpus is in context.
sme-eval retrieve --adapter full-context \
--db /path/to/vault/ \
--questions corpus.yaml \
--json d1.jsonsme/conditions/karpathy_compiled.py reads a pre-compiled wiki
produced by sme-eval compile-wiki — an LLM-condensed version of
the raw vault, modelled on Karpathy's personal LLM-Wiki
setup.
Trades one-time compilation cost for a denser, lower-noise context.
The interesting question D2 answers — and D1 cannot — is whether
LLM-compiled compression improves answer accuracy at the same context
budget.
# Compile the vault first (one-time, cached by content hash)
sme-eval compile-wiki --vault /path/to/vault/ --output /path/to/compiled/
# Then run retrieval against the compiled wiki
sme-eval retrieve --adapter karpathy-compiled \
--db /path/to/compiled/ \
--questions corpus.yaml \
--json d2.jsonMIT. See LICENSE.
