Skip to content

techempower-org/multipass-structural-memory-eval

 
 

multipass-structural-memory-eval

Multi Pass. Mem Palace. A mockup of Leeloo Dallas's MULTI PASS ID card from The Fifth Element, stamped MEM PALACE where the issuing authority would normally appear, with the caption FREE PASS UNLOCK YOUR CREATIVITY ELIMINATES THE NEED TO REMEMBER EVERYTHING YOU'VE EVER TYPED.

A diagnostic framework for memory systems — RAG, knowledge graphs, personal knowledge bases, conversational memory — that tests what the system knows about its own structure, not just whether it can retrieve memories.

"Multipass!" — Leeloo, The Fifth Element. The name is a nod to a joke from an earlier MemPalace issue thread (visual reference above), and also to what the framework actually does: multiple passes over every memory system under test, across multiple corpus shapes and multiple retrieval conditions (A / B / C), so brittle default behaviours that hide on any single pass become visible when the readings are compared side by side.

Contents

What this is · Status · Install · Corpora · Next steps · Adapters

What this is

See the nine-category menu for what each test measures and which to run for your setup.

Standard memory benchmarks (LongMemEval, LoCoMo, MINE, GraphRAG-Bench, BEAM) ask "can you find a memory?" That's necessary but not sufficient. A filing cabinet can find a memory. The question is what the structure of a memory system gives you beyond retrieval — and whether your specific build, under your specific harness, with your specific model, actually uses it.

SME defines a nine-category test menu. Categories 1–8 measure graph structure and offline retrieval. Category 9 (The Handshake) measures harness integration — whether the model actually reaches the memory when it runs in production. Cats 1–8 are where every published benchmark stops; Cat 9 is the gap every deployment engineer runs into.

Each category has a Cat N identifier (for code) and a descriptive "palace-nod" name (The Lookup, The Stairway, The Blueprint, The Handshake, and so on) so readable output doesn't require a lookup table.

Status

Beta-level instrumentation, actively evolving (v0.2.0). Ten adapters (flat, mempalace, mempalace-daemon, familiar, rlm, longhand, ladybugdb, full-context, karpathy-compiled, plus the SMEAdapter ABC template), nine CLI commands (retrieve, analyze, cat8, cat2c, cat4, cat5, check, cat9, compile-wiki), two evaluation corpora (jp-realm-v0.1 and good-dog-corpus), a LongMemEval cross-validation harness, Karpathy-baseline conditions D1/D2 (full-corpus-in-context and LLM-compiled wiki), B-Cubed scoring for alias resolution (Cat 4a), and a specification for the remaining categories. Diagnostic posture, not benchmark — the defensible findings are before/after deltas under identical conditions and within-system A/B/C/D ablations. See the spec and the onboarding guide for the full honest-limitations discussion.

Install

pip install -e .
# Optional extras:
pip install -e ".[topology]"   # Ripser + python-louvain (for gap detection)
pip install -e ".[ladybugdb]"  # LadybugDB adapter
pip install -e ".[dev]"        # pytest, ruff

Installs as the Python package sme-eval with CLI entrypoint sme-eval. The GitHub repo is multipass-structural-memory-eval; the acronym SME (Structural Memory Evaluation) is used throughout the documentation and code.

Quick start: run your first diagnostic in 5 minutes with the onboarding guide. Need the spec? Start at docs/sme_spec_v8.md.

Corpora

SME ships two evaluation corpora and supports loading a third (LongMemEval) for cross-validation:

  • jp-realm-v0.1 — 30 questions against a personal knowledge palace (tech-domain, biographical). The original development corpus. Baseline readings for familiar, mempalace-daemon, and rlm adapters live in baselines/.

  • good-dog-corpus — 24 notes across 6 domains (veterinary research, municipal policy, breed standards, nutrition safety, behavioral research, community journalism). Non-technical, real-world, ontology-first. Designed to stress-test alias resolution, contradiction detection, and temporal supersession. Ships with a full ontology design narrative explaining every schema decision. See the good-dog-corpus README.

  • LongMemEval — loader for the 500- question LongMemEval-cleaned dataset (Wu et al., ICLR 2025) with a primary-source-verified category mapping to SME. Used for cross-validation of SME's scoring against the field's most-cited benchmark.

The multi-corpus methodology is load-bearing: a single corpus shape gives misleading conclusions because brittle default behaviours hide on any single retrieval profile. See the onboarding guide for the full argument.

Next steps

  • docs/ideas.md — onboarding guide. Start here if you want to run SME against your own memory system. Covers the nine-category menu, how to write an adapter for your backend, how to write a corpus from your own content, how to run the implemented categories, and how to read what comes out the other end. This is also where the methodology framing lives — why A/B/C isolation matters, why multi-corpus testing is load-bearing, and why "the delta is the product, the levels are decoration."

  • docs/sme_spec_v8.md — full specification. Precise category-by-category definitions, metric formulas, adapter interface contract, topology layer details, and the Cat 9 (The Handshake) harness-integration spec. Reference material — read the onboarding guide first if you want to get a test run going.

  • docs/cross_validation_2026.md — current work. Cross-validation of SME categories against LongMemEval / MemoryBench, Karpathy-condition D baselines (full- corpus-in-context), and first readings from the live benchmark harness. Active development; this is where near-term SME findings land.

  • docs/industry_standards_integration.md — integration audit. Survey of where SME rolls its own vs. where battle-tested standards exist (SHACL, PROV-O, OpenLineage, B-Cubed, Ripser). Constitutional principle: SME stays lightweight and locally runnable — no server hosting required.

  • docs/ingestigation.md — Cat 4 deep dive. Renames and re-scopes Category 4 with a primary-source-verified survey of existing tools (SHACL, W3C PROV-O, ProVe, Splink, OpenLineage, Great Expectations) and proposed sub-test additions.

Adapters

SME ships adapters for several memory systems. Each adapter teaches the framework to speak the wire protocol of a specific system so the same eval questions can run across multiple backends. Adapters live in sme/adapters/ and implement the SMEAdapter ABC.

mempalace-daemon — by jphein

sme/adapters/mempalace_daemon.py talks to a running palace-daemon over HTTP — by jphein. No filesystem access, no ChromaDB import, no shared-process constraint with the daemon. Use this adapter when MemPalace is fronted by the daemon (the daemon is the single writer to the palace) — the existing mempalace adapter is still correct for single-process upstream installs without the daemon.

Wired endpoints:

  • query()GET /search?q=…&kind=…&limit=… with X-API-Key. Default kind="content" excludes Stop-hook auto-save checkpoints; pass --kind all to disable. Daemon-side warnings (e.g. broken HNSW index) are surfaced into QueryResult.error as WARN: … so Cat 9 scoring can distinguish flagged retrieval from clean retrieval.
  • get_graph_snapshot() → tries GET /graph first (palace-daemon ≥1.6.0); on 404, falls back to walking mempalace_list_wings, mempalace_list_rooms per wing, and mempalace_list_tunnels via POST /mcp. The MCP fallback is slower (~30s on a 151K-drawer palace) but works against any palace-daemon version.

Auth resolution: explicit --api-url / --api-key flags → ~/.config/palace-daemon/env (PALACE_DAEMON_URL, PALACE_API_KEY) → process environment.

Invocation:

# With explicit daemon URL
sme-eval retrieve --adapter mempalace-daemon \
    --api-url http://your-daemon:8085 \
    --questions corpus.yaml \
    --kind content \
    --json out.json

# Or, if ~/.config/palace-daemon/env is populated, no flags needed
sme-eval retrieve --adapter mempalace-daemon --questions corpus.yaml

The same --api-url / --api-key / --kind flags work on the cat4, cat5, and check subcommands.

Why this matters: the engram-2 critique ("0.984 R@5 but 17% E2E QA accuracy") is about the integration-under-production-model slice that Cat 9 measures. Running SME's retrieve through the daemon surfaces exactly the kind of gap that critique describes — the adapter's WARN-soft-error treatment means the framework records "retrieval ran but the daemon flagged it as degraded" as a first- class signal, not as a hard failure that hides the issue.

Why the existing adapter still has a use

For users running upstream MemPalace without palace-daemon (the default install pattern), the existing mempalace adapter is correct — single process, no daemon, direct ChromaDB access is fine. The daemon adapter is additive, for users who've adopted palace-daemon's single-writer architecture.

familiar — by jphein

familiar.realm.watch is a retrieval pipeline that wraps palace-daemon with reranking, temporal decay, extractive compression, and grounding directives. jphein built it; sme/adapters/familiar.py lets SME measure its full end-to-end contribution on top of the raw daemon. The sibling mempalace-daemon adapter measures palace alone — running both on the same corpus shows what the pipeline layer adds.

Wired endpoints:

  • query()POST /api/familiar/eval with body {query, limit, kind, mock}. Familiar's eval endpoint already returns SME-shape {answer, context_string, retrieved_entities, retrieved_edges, error, warnings, available_in_scope} natively (it was designed against the SME contract), so the adapter is mostly deserialization with the same WARN: error-prefix translation as mempalace-daemon.
  • get_graph_snapshot()GET /api/familiar/graph. Familiar proxies palace-daemon's /graph with a 5-minute server-side cache; payload mapping reuses sme/adapters/_graph_mapping.py shared with mempalace-daemon.
  • get_harness_manifest() → forward-compat for Cat 9. Returns [ToolCall, MCPResource] once sme.harness ships; [] until then.

Determinism: --mock (default) skips LLM inference so Cat 1 substring scoring is reproducible across runs. Use --no-mock to include the model output in the per-question record (intended for future Cat 9 work).

Invocation:

# Default: --mock for Cat 1 determinism
sme-eval retrieve --adapter familiar     --api-url https://your-familiar-host     --questions corpus.yaml     --json familiar.json

# Compare against the same palace via the daemon adapter
sme-eval retrieve --adapter mempalace-daemon     --api-url http://your-daemon:8085     --questions corpus.yaml     --json daemon.json

# The score delta = what familiar's v0.2 pipeline is worth

The --api-url, --mock/--no-mock, and --familiar-timeout flags work on cat4, cat5, check, and retrieve subcommands.

rlm — by jphein

sme/adapters/rlm_adapter.py treats RLM (a fork of alexzhang13/rlm) as the read-side orchestrator rather than a deterministic retrieval pipeline. The LLM itself decides when to call mempalace_search, with what queries, and how to compose results. familiar's pipeline is the baseline this adapter is benchmarked against, not the thing it replaces.

Design: RLM gets mempalace_search registered as a custom_tools callable. The adapter wraps that callable to capture every search result into a per-query buffer; after rlm.completion() returns, the buffer's contents become context_string (in tool-call order) and retrieved_entities (one Entity per drawer). Same scoring contract as every other adapter.

Endpoint override: RLM_BASE_URL / RLM_MODEL / RLM_API_KEY env vars point the openai backend at any compatible endpoint -- local llama.cpp, hosted Llama 3.3 70B, anything OpenAI-shaped -- without touching the cloud-chat-assistant config-file fallback path.

First two live readings on jp-realm-v0.1 (30 questions):

Run Mean recall Tool-call distribution
rlm + Qwen 2.5 7B Q5_K_M 46.67% 25/30 zero-call, 2/30 used tool
rlm + Llama 3.3 70B 46.67% 22/30 zero-call, 8/30 used tool
familiar v0.3.9 (deterministic) 78.33% n/a

Both RLM runs land at the same aggregate recall despite a 4x difference in tool-invocation rate -- they ceiling at the orchestrator's willingness to invoke the tool, not at retrieval quality. This is the data behind the 9a invocation-rate issue filed upstream. See the onboarding guide for the full discussion and the per-question deltas.

Invocation:

RLM_BASE_URL=https://your-endpoint RLM_MODEL=llama-3.3-70b RLM_API_KEY=... \
    PALACE_DAEMON_URL=http://your-daemon:8085 PALACE_API_KEY=... \
    sme-eval retrieve --adapter rlm \
    --questions sme/corpora/jp_realm_v0_1/questions.yaml \
    --json baselines/rlm_$(date +%Y%m%d).json

longhand — verbatim-first cohort

sme/adapters/longhand.py measures Longhand (by Nate Nelson), a persistent local memory server for Claude Code. Longhand reads the raw session JSONL that Claude Code already writes (~/.claude/projects/<project>/<session-id>.jsonl) and indexes it locally into SQLite (verbatim source of truth) plus ChromaDB (vector search) under ~/.longhand/ — no network, no API calls. Like MemPalace it stores exact words rather than letting a model decide what matters, which puts it in the same verbatim-first cohort.

Daemon-strict by design. The adapter shells out to the longhand CLI (longhand search --json) rather than opening Longhand's ChromaDB or SQLite directly — the same single-writer discipline as mempalace-daemon's HTTP-only access. A second process holding handles to a store Longhand assumes it owns is exactly the failure mode the daemon adapters exist to avoid.

Shape:

  • query()longhand search <q> --json --limit N (optional --project). Tolerates either a bare-list or {results: [...]} JSON shape; CLI/timeout/parse failures come back as a QueryResult.error (CLI_ERROR / TIMEOUT / BAD_JSON / NO_RESULTS) so Cat 9 scoring can tell a failed call-through from an empty store.
  • get_graph_snapshot()([], []). Longhand is a verbatim session archive, not a knowledge graph, so structural categories (Cat 4/5/8) are not meaningful against it; the retrieval categories (Cat 1/2c/3/6) and Cat 9 are.
  • ingest_corpus() → not implemented. Longhand ingests Claude Code sessions through its own hooks, not arbitrary seeded corpora — this is a diagnostic-only (Mode B) adapter, like mempalace-daemon.
  • get_harness_manifest() → declares Longhand's MCP search tool as a Cat 9 surface; the probe exercises the same SQLite+Chroma read path the MCP tool uses.

Invocation:

# Resolves `longhand` on PATH by default
sme-eval retrieve --adapter longhand \
    --questions corpus.yaml \
    --json longhand.json

The adapter resolves the longhand binary on PATH. To point at a specific binary or scope to a single project, construct LonghandAdapter(bin_path=..., project=...) directly — the registry exposes bin_path, home_dir, n_results, timeout_s, and project as constructor kwargs.

full-context — Karpathy Condition D1

sme/conditions/full_context.py concatenates every .md file under a vault directory and returns that as the query's context_string. No retrieval, no graph, no index. This is the deliberate-floor baseline answering the question: at what corpus size does structured retrieval start outperforming flat context-window retrieval?

Structural categories (Cat 4/5/8) are not meaningful here — there is no graph. Retrieval categories (Cat 1/2c/3/6) produce maximum-recall, maximum-token-cost readings since the entire corpus is in context.

sme-eval retrieve --adapter full-context \
    --db /path/to/vault/ \
    --questions corpus.yaml \
    --json d1.json

karpathy-compiled — Karpathy Condition D2

sme/conditions/karpathy_compiled.py reads a pre-compiled wiki produced by sme-eval compile-wiki — an LLM-condensed version of the raw vault, modelled on Karpathy's personal LLM-Wiki setup. Trades one-time compilation cost for a denser, lower-noise context. The interesting question D2 answers — and D1 cannot — is whether LLM-compiled compression improves answer accuracy at the same context budget.

# Compile the vault first (one-time, cached by content hash)
sme-eval compile-wiki --vault /path/to/vault/ --output /path/to/compiled/

# Then run retrieval against the compiled wiki
sme-eval retrieve --adapter karpathy-compiled \
    --db /path/to/compiled/ \
    --questions corpus.yaml \
    --json d2.json

License

MIT. See LICENSE.

About

A diagnostic framework for memory systems — RAG, knowledge graphs, personal knowledge bases, conversational memory. Nine-category test menu from factual retrieval through ontology coherence to harness integration. Multi-corpus, multi-condition, beta-level instrumentation.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.1%
  • Shell 0.9%