Skip to content

fix: call quarantine_stale_hnsw() in make_client(); lower threshold to 5min#1173

Merged
igorls merged 5 commits into
MemPalace:developfrom
techempower-org:fix/quarantine-on-make-client
Apr 26, 2026
Merged

fix: call quarantine_stale_hnsw() in make_client(); lower threshold to 5min#1173
igorls merged 5 commits into
MemPalace:developfrom
techempower-org:fix/quarantine-on-make-client

Conversation

@jphein

@jphein jphein commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

make_client() called _fix_blob_seq_ids but never quarantine_stale_hnsw, so every fresh process (stop hook, precompact hook, CLI commands) opened a drifted palace and segfaulted in chromadb_rust_bindings.abi3.so before any write-path protection could fire.

Relationship to #1062: #1062 wires the quarantine call at MCP server startup, which covers long-lived server processes. This fix adds it to make_client() itself — the call site that all callers (server, hooks, CLI, tests) share — so every fresh PersistentClient open is protected regardless of entry point. The two fixes are complementary.

Two changes in this PR:

  1. Call quarantine_stale_hnsw() in make_client() before PersistentClient is created. The quarantine rename happens before ChromaDB's Rust bindings ever touch the corrupted segment.

  2. Lower stale_seconds default from 3600 → 300. A segment with 0.96h (3460s) drift caused production segfaults before the 1h threshold fired. ChromaDB's HNSW flush cadence means legitimate drift is on the order of seconds to low minutes; 5 minutes gives headroom without admitting clearly corrupt segments.

Context

Both this PR and #1171 (cross-process write lock) address the same root failure: multiple mcp_server.py processes writing to a shared palace concurrently corrupt the HNSW segment. #1171 prevents future corruption at write time; this PR ensures any residual drift from prior concurrent writes is quarantined before it causes a segfault on the next open.

Test plan

  • 130 tests pass (tests/test_mcp_server.py, tests/test_hooks_cli.py) against upstream/develop base
  • Verified stop hook no longer segfaults after quarantine clears drifted segments

Generated with Claude Code

Copilot AI review requested due to automatic review settings April 24, 2026 16:08
@igorls igorls added bug Something isn't working storage labels Apr 24, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ensures stale/corrupted ChromaDB HNSW segments are quarantined before opening a palace to prevent fresh-process SIGSEGVs in the Rust bindings, and tightens the stale-segment detection threshold.

Changes:

  • Lower quarantine_stale_hnsw() default stale_seconds from 3600s to 300s.
  • Call quarantine_stale_hnsw() from ChromaBackend.make_client() before constructing chromadb.PersistentClient.
  • Update quarantine_stale_hnsw() docstring to reflect the new default threshold rationale.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 491 to 493
_fix_blob_seq_ids(palace_path)
quarantine_stale_hnsw(palace_path)
return chromadb.PersistentClient(path=palace_path)

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_client() now quarantines stale HNSW segments before constructing PersistentClient, but the primary code path used by many CLI/tools (ChromaBackend().get_collection(...)_client() at chroma.py:469+) still constructs chromadb.PersistentClient(...) without calling quarantine_stale_hnsw. This means fresh-process opens through get_collection (e.g. mempalace/cli.py, mempalace/dedup.py, mempalace/repair.py, mempalace/palace.py) can still hit the same segfault-on-open failure mode. Consider centralizing client construction so both _client() and make_client() share the same preflight (either call quarantine_stale_hnsw in _client() as well, or have _client() delegate to make_client()).

Copilot uses AI. Check for mistakes.
Comment thread mempalace/backends/chroma.py Outdated
Comment on lines 492 to 493
quarantine_stale_hnsw(palace_path)
return chromadb.PersistentClient(path=palace_path)

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default stale_seconds threshold is changed to 300s and make_client() now calls quarantine_stale_hnsw(), but there’s no test coverage that locks in this new default or verifies that make_client() performs the quarantine preflight before instantiating PersistentClient. Adding a unit test (e.g., via monkeypatching quarantine_stale_hnsw / chromadb.PersistentClient) would help prevent regressions, especially given the safety implications (segfault avoidance).

Copilot uses AI. Check for mistakes.
jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 24, 2026
@jphein jphein force-pushed the fix/quarantine-on-make-client branch from cd76f5f to 378269b Compare April 24, 2026 21:24
jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 24, 2026
… state

Three stale sections updated:

- Fork change queue: row 8 (.blob_seq_ids_migrated marker) struck
  through → FILED as MemPalace#1177. Two new rows added for segfault fixes
  discovered today (MemPalace#1171 concurrent-write lock, MemPalace#1173 quarantine in
  make_client) that weren't in the queue because the bugs surfaced
  today, not during the original 2026-04-21 triage.

- Open upstream PRs: was showing 3 of 10 PRs. Now shows all 10 with
  current CI/review state. All rebased onto current upstream/develop
  and MERGEABLE as of today.

- Merged since v3.3.1: added v3.3.3 release (2026-04-24) with its
  constituent merges — MemPalace#942, MemPalace#833, MemPalace#1097, MemPalace#1145, MemPalace#1147, MemPalace#1148/1150/1157
  entity-detection overhaul (via @igorls's MemPalace#1175 stacked-PR rescue),
  MemPalace#1166 palace-path security, #340/MemPalace#1093 install regression, plus MemPalace#851
  from the 2026-04-22 batch.
jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 24, 2026
…ostgres

Three things merged into one README pass:

1. Badge: link version-3.3.4 to jphein/mempalace/releases (the v3.3.4
   tag we just pushed) and add an upstream-3.3.3 secondary badge so
   readers can tell fork vs upstream version at a glance. Was sitting
   uncommitted from earlier today.

2. Multi-client coordination section: replaced the three-fix v3.3.4
   summary with a four-fix one. Added @felipetruman's MemPalace#976 num_threads
   pin (cherry-picked at 552a0d7) as fix #1 — the actual root-cause
   fix. Reframed our MemPalace#1171/MemPalace#1173/MemPalace#1177 as defense-in-depth around
   symptoms. Walked back palace-daemon from "primary concurrency story
   in progress" to "deferred pending observation" — with MemPalace#976's fix in
   place, the daemon's same-machine value drops; multi-machine and
   Windows remain its differentiators but neither is current pain.

3. Postgres + pgvector: walked back from "parallel track" framing to
   "long-term option, no immediate move" for the same reason. Migration
   cost stays real, current pain is mitigated, decision deferred until
   v3.3.4 stack is observed in production or TS rewrite ships.

Removed two stale paragraphs that were left over from the previous
"daemon as primary" framing.
jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 25, 2026
…HNSW fixes

Bring in 29 commits from upstream/develop since the last merge (2026-04-23):

Major absorbed changes:
- MemPalace#976 (Felipe Truman): HNSW graph corruption fix, mine_global_lock for
  fan-out, MAX_PRECOMPACT_BLOCK_ATTEMPTS=2 for /compact deadlocks. Closes
  MemPalace#974/MemPalace#965/MemPalace#955; likely resolves MemPalace#1172 too.
- MemPalace#1179 (Igor): CLI mempalace search routes through _hybrid_rank,
  legacy-metric warning + _warn_if_legacy_metric, invariant tests on
  hnsw:space=cosine across all 5 collection-creation paths.
- MemPalace#1180/MemPalace#1183/MemPalace#1184: cross-wing topic tunnels, init mine UX, --auto-mine.
- MemPalace#1185 (perf/batched-upsert-gpu): batched ChromaDB upserts, GPU device
  detection via mempalace/embedding.py.
- MemPalace#1182: graceful Ctrl-C during mempalace mine.
- MemPalace#1168: tunnel permissions security fix.

Conflict resolutions (8 files):
- searcher.py: kept fork's CLI delegation through search_memories (cleaner
  single-source-of-truth path); upstream's inline-retrieval branch dropped.
  Merged upstream's print formatting (cosine= + bm25=) with fork's
  matched_via reporting from sqlite BM25 fallback.
- backends/chroma.py: kept fork's _BLOB_FIX_MARKER + palace_path arg to
  ChromaCollection (MemPalace#1171 write lock); merged upstream's **ef_kwargs
  (embedding_function support from MemPalace#1185). Removed duplicate
  _pin_hnsw_threads (fork already cherry-picked Felipe's earlier).
- mcp_server.py: kept fork's palace_path arg + upstream's clearer comment
  on hnsw:num_threads=1 rationale.
- miner.py: took upstream's serial mine() flow (mine_global_lock + Ctrl-C),
  RESTORED fork's strict detect_room — substring matching from upstream
  breaks fork-only test_detect_room_priority1_no_substring_match. Added
  `import re` for word-boundary regex matching. Fork-ahead concurrent
  mining (workers=, ThreadPoolExecutor from 5cd14bd) is dropped — daemon
  migration deprioritizes local concurrent mining; can re-port if needed.
- CHANGELOG.md: combined fork's segfault-trio narrative with upstream's
  v3.3.4 release notes.
- HOOKS_TUTORIAL.md: took MEMPALACE_PYTHON env var name (fork README was
  stale; hooks already use this name per fork-ahead #19).
- test_convo_miner_unit.py: took both contextlib + pytest imports.
- test_searcher.py: imported upstream's 3 new TestSearchCLI tests but
  skipped them with TODOs — they assume upstream's inline-retrieval CLI
  with simpler mocks. Rewrite needed for fork's delegated search_memories
  path (sqlite BM25 fallback + scope counting).

Test result: 1334 passed, 3 skipped (the upstream tests above), 3 warnings.

Implications for open fork PRs:
- MemPalace#1171 (cross-process write lock at adapter) becomes structurally
  redundant given MemPalace#976's mine_global_lock + daemon-strict serialization.
  Slated for close with thank-you to Felipe.
- MemPalace#1173/MemPalace#1177 still relevant (quarantine threshold + blob_seq marker).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 26, 2026
Symptom on the canonical disks daemon: drift quarantines firing every
10–30 minutes throughout the day under steady write load. Logs show
.drift-* directories accumulating despite the daemon being the only
writer (no Syncthing replication of palace data).

Root cause is a false-positive thrash in the quarantine heuristic:

- chroma.sqlite3 mtime bumps on every write (millisecond cadence).
- HNSW segment files (data_level0.bin) only flush to disk on
  chromadb's internal cadence, which can lag minutes behind sqlite
  under continuous write load.

Once the gap exceeds the 300s threshold, quarantine_stale_hnsw renames
a perfectly valid HNSW segment, chromadb rebuilds it from scratch, and
the cycle repeats as soon as the next batch of writes lands. The 300s
threshold (lowered from 3600s in PR MemPalace#1173 after a 0.96h-drift production
segfault) is correct for the cross-machine-replication failure mode it
was designed for, but wrong for a daemon-strict deployment whose only
"drift" source is its own benign flush lag.

Fix: gate the proactive quarantine check to the first ``make_client()``
invocation per palace per process (``ChromaBackend._quarantined_paths``
set). Real cold-start drift (replication, partial restore, crashed-mid-
write) still gets caught — that's exactly when a fresh daemon process
opens the palace. Real runtime drift on observed HNSW errors still gets
caught via palace-daemon's ``_auto_repair`` which calls
``quarantine_stale_hnsw`` directly, bypassing this gate.

Two new tests in test_backends.py verify single-fire-per-palace and
per-palace independence. Conftest clears the gate between tests.

Suite 1362/1362 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@bensig bensig left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Two changes, both small, both defensive.

Threshold change: 1h → 5min

The body cites "0.96h drift was observed causing segfaults in production" — that's the right reason to tighten. ChromaDB's HNSW flush cadence is seconds-to-minutes under normal load, so the 1h threshold was leaving an hour-wide window where stale segments remained loadable and would crash on access. 5min is still well above the legitimate-flush window.

One minor risk to flag: slow-disk setups (network FS, btrfs with delayed allocation, encrypted filesystems with sync overhead) could see legitimate flushes take longer than 5min on huge palaces. If anyone reports false-positive quarantines after this lands, the threshold becomes the obvious knob. Worth keeping it as a kwarg so users can override per-palace if needed (which it already is — `quarantine_stale_hnsw(palace_path, stale_seconds=...)`).

Auto-call in `make_client()`

Defense-in-depth for the same failure mode that #976 addresses but at a different layer. Every `PersistentClient` creation now runs the quarantine check first. Cost is a few `os.stat` calls per palace open — negligible on local disks; might be measurable on slow network mounts but still well under any user-perceptible threshold.

Tests

Ran `pytest tests/ -k 'chroma or quarantine or blob or hnsw or repair'` on this branch — 53 passed, no regressions.

Ship it.

jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 26, 2026
…ush-lag

Previous: quarantine fired whenever sqlite_mtime - hnsw_mtime exceeded
the (lowered, in MemPalace#1173) 300s threshold. ChromaDB 1.5.x flushes HNSW
asynchronously and a clean shutdown does not force-flush, so the on-
disk HNSW is *always* meaningfully older than chroma.sqlite3 — that's
the steady state, not corruption. Quarantine renamed valid HNSW
segments on every cold-start, chromadb created empty replacements,
vector recall went to 0/N until rebuild.

Confirmed in production on the disks daemon journal, 2026-04-26
06:56:45: three of three HNSW segments quarantined on cold-start with
538-557s mtime gaps (post-clean-shutdown flush lag), leaving a
151,478-drawer palace with vector_ranked=0. Drift directories at
*.drift-20260426-065645/ each contained a complete 253MB data_level0.bin
plus 18MB index_metadata.pickle — clearly healthy indexes, renamed by
the false-positive heuristic.

Fix: two-stage gate.

  1. mtime gate (existing) — gap > stale_seconds is necessary.
  2. integrity gate (new) — sniff index_metadata.pickle for chromadb's
     expected protocol/terminator bytes (PROTO 0x80 head, STOP 0x2e
     tail) and a non-trivial size, WITHOUT deserializing the file.
     Healthy segment with mtime drift → keep in place; truncated /
     zero-filled / partial-flush → quarantine.

Format-sniff is deliberately non-deserializing — pickle deserialization
can execute arbitrary code, and the PROTO+STOP byte presence + size
floor is sufficient to distinguish a complete chromadb write from
truncation, zero-fill, or a partial flush during process kill. Real
load failures (the rare case where the bytes look right but chromadb
fails to load) still surface to palace-daemon's _auto_repair, which
calls quarantine_stale_hnsw directly on observed HNSW errors and
bypasses this gate.

The cold-start gate from 70c4bc6 (row 24) remains as a perf optimization
— even with the integrity check, repeating the sniff on every reconnect
is unnecessary work — but its load-bearing role is now covered by this
deeper fix.

4 new tests in test_backends.py:

  - test_quarantine_stale_hnsw_renames_corrupt_segment (drift + bad meta)
  - test_quarantine_stale_hnsw_leaves_healthy_segment_with_drift_alone
    (drift + valid meta — the production case at 06:24)
  - test_quarantine_stale_hnsw_leaves_segment_without_metadata_alone
    (fresh / never-flushed, no meta file)
  - test_quarantine_stale_hnsw_renames_truncated_metadata (under-floor
    size, partial-flush shape)

Existing test_quarantine_stale_hnsw_renames_drifted_segment renamed
to renames_corrupt_segment with explicit corrupt meta_bytes — the old
"renames any drift" contract is gone.

Suite 1366/1366 pass.

Coordinated cross-repo with palace-daemon's auto-repair-on-startup
workaround (separate agent's commit ed3a892). With this fork-side fix
the auto-repair becomes belt-and-suspenders; the structural cause
of empty-HNSW-on-restart is addressed at the quarantine layer.

CLAUDE.md row 26 + README fork-change-queue row + test count
1363→1366.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 26, 2026
Symptom on the canonical disks daemon: drift quarantines firing every
10–30 minutes throughout the day under steady write load. Logs show
.drift-* directories accumulating despite the daemon being the only
writer (no Syncthing replication of palace data).

Root cause is a false-positive thrash in the quarantine heuristic:

- chroma.sqlite3 mtime bumps on every write (millisecond cadence).
- HNSW segment files (data_level0.bin) only flush to disk on
  chromadb's internal cadence, which can lag minutes behind sqlite
  under continuous write load.

Once the gap exceeds the 300s threshold, quarantine_stale_hnsw renames
a perfectly valid HNSW segment, chromadb rebuilds it from scratch, and
the cycle repeats as soon as the next batch of writes lands. The 300s
threshold (lowered from 3600s in PR MemPalace#1173 after a 0.96h-drift production
segfault) is correct for the cross-machine-replication failure mode it
was designed for, but wrong for a daemon-strict deployment whose only
"drift" source is its own benign flush lag.

Fix: gate the proactive quarantine check to the first ``make_client()``
invocation per palace per process (``ChromaBackend._quarantined_paths``
set). Real cold-start drift (replication, partial restore, crashed-mid-
write) still gets caught — that's exactly when a fresh daemon process
opens the palace. Real runtime drift on observed HNSW errors still gets
caught via palace-daemon's ``_auto_repair`` which calls
``quarantine_stale_hnsw`` directly, bypassing this gate.

Two new tests in test_backends.py verify single-fire-per-palace and
per-palace independence. Conftest clears the gate between tests.

Suite 1362/1362 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 26, 2026
…ush-lag

Previous: quarantine fired whenever sqlite_mtime - hnsw_mtime exceeded
the (lowered, in MemPalace#1173) 300s threshold. ChromaDB 1.5.x flushes HNSW
asynchronously and a clean shutdown does not force-flush, so the on-
disk HNSW is *always* meaningfully older than chroma.sqlite3 — that's
the steady state, not corruption. Quarantine renamed valid HNSW
segments on every cold-start, chromadb created empty replacements,
vector recall went to 0/N until rebuild.

Confirmed in production on the disks daemon journal, 2026-04-26
06:56:45: three of three HNSW segments quarantined on cold-start with
538-557s mtime gaps (post-clean-shutdown flush lag), leaving a
151,478-drawer palace with vector_ranked=0. Drift directories at
*.drift-20260426-065645/ each contained a complete 253MB data_level0.bin
plus 18MB index_metadata.pickle — clearly healthy indexes, renamed by
the false-positive heuristic.

Fix: two-stage gate.

  1. mtime gate (existing) — gap > stale_seconds is necessary.
  2. integrity gate (new) — sniff index_metadata.pickle for chromadb's
     expected protocol/terminator bytes (PROTO 0x80 head, STOP 0x2e
     tail) and a non-trivial size, WITHOUT deserializing the file.
     Healthy segment with mtime drift → keep in place; truncated /
     zero-filled / partial-flush → quarantine.

Format-sniff is deliberately non-deserializing — pickle deserialization
can execute arbitrary code, and the PROTO+STOP byte presence + size
floor is sufficient to distinguish a complete chromadb write from
truncation, zero-fill, or a partial flush during process kill. Real
load failures (the rare case where the bytes look right but chromadb
fails to load) still surface to palace-daemon's _auto_repair, which
calls quarantine_stale_hnsw directly on observed HNSW errors and
bypasses this gate.

The cold-start gate from 70c4bc6 (row 24) remains as a perf optimization
— even with the integrity check, repeating the sniff on every reconnect
is unnecessary work — but its load-bearing role is now covered by this
deeper fix.

4 new tests in test_backends.py:

  - test_quarantine_stale_hnsw_renames_corrupt_segment (drift + bad meta)
  - test_quarantine_stale_hnsw_leaves_healthy_segment_with_drift_alone
    (drift + valid meta — the production case at 06:24)
  - test_quarantine_stale_hnsw_leaves_segment_without_metadata_alone
    (fresh / never-flushed, no meta file)
  - test_quarantine_stale_hnsw_renames_truncated_metadata (under-floor
    size, partial-flush shape)

Existing test_quarantine_stale_hnsw_renames_drifted_segment renamed
to renames_corrupt_segment with explicit corrupt meta_bytes — the old
"renames any drift" contract is gone.

Suite 1366/1366 pass.

Coordinated cross-repo with palace-daemon's auto-repair-on-startup
workaround (separate agent's commit ed3a892). With this fork-side fix
the auto-repair becomes belt-and-suspenders; the structural cause
of empty-HNSW-on-restart is addressed at the quarantine layer.

CLAUDE.md row 26 + README fork-change-queue row + test count
1363→1366.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jphein jphein requested a review from igorls as a code owner April 26, 2026 15:07
@jphein

jphein commented Apr 26, 2026

Copy link
Copy Markdown
Collaborator Author

Heads-up — pushed two follow-on commits to this PR after a production incident on 2026-04-26 surfaced false-positive shapes that the original single-commit version would have regressed for high-write-rate users.

  • 164e4ea — cold-start gate (ChromaBackend._quarantined_paths) so the proactive check fires once per palace per process, not on every reconnect.
  • 0f5d8e1 — integrity sniff-test (_segment_appears_healthy) so a healthy segment with mtime drift (the chromadb-1.5.x async-flush steady state — clean shutdown does not force-flush) is no longer destructively renamed.

PR description rewritten to layer all three. Tests still green (34 passed).

The motivating production case: 06:56:45 PDT cold-start quarantined three healthy 253MB HNSW segments because sqlite was 538–557s newer; chromadb created empty replacements; vector recall went to 0/151,478 until a 15-min mempalace repair --mode rebuild. Without the integrity gate, the threshold-only heuristic can't distinguish flush-lag from corruption.

jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 26, 2026
…#1173 status

CLAUDE.md row 28 documents the canonical-source pipeline (5a01aec)
that landed earlier today. PR table updated to reflect bensig's
2026-04-26 approvals on four of our open PRs:

  - MemPalace#1173 (HNSW quarantine wire): approved on the original 1-commit
    shape, then force-pushed two safety commits (cold-start gate +
    integrity sniff-test) after a production cold-start incident
    destroyed three healthy 253MB segments. Now mergeable=CONFLICTING
    against develop (which moved with MemPalace#1210, MemPalace#1205, MemPalace#976) — needs
    rebase + re-review.
  - MemPalace#1177, MemPalace#1198, MemPalace#1201: approved, mergeable, awaiting merge.

YAML manifest gets a new entry for the doc pipeline itself; FORK_CHANGELOG
regenerated to match. promises.md (in claude-config, not this repo)
got a long log entry covering today's full output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jphein and others added 3 commits April 26, 2026 09:39
…o 5min

make_client() called _fix_blob_seq_ids but skipped quarantine_stale_hnsw,
so every fresh process (stop hook, precompact hook, CLI) opened a drifted
palace and segfaulted in chromadb_rust_bindings before any write-path
protection could fire.

MemPalace#1062 wires the quarantine call at MCP server startup (covers long-lived
server processes). This fix adds it to make_client() itself — the call
site that all callers (server, hooks, CLI, tests) pass through — so every
fresh PersistentClient open is protected regardless of entry point.

Also lowers stale_seconds default from 3600 to 300: a 0.96h-drifted
segment caused production segfaults before the 1h threshold fired.
ChromaDB's HNSW flush cadence means legitimate drift is seconds to low
minutes; 5min gives headroom without admitting clearly corrupt segments.
Symptom on the canonical disks daemon: drift quarantines firing every
10–30 minutes throughout the day under steady write load. Logs show
.drift-* directories accumulating despite the daemon being the only
writer (no Syncthing replication of palace data).

Root cause is a false-positive thrash in the quarantine heuristic:

- chroma.sqlite3 mtime bumps on every write (millisecond cadence).
- HNSW segment files (data_level0.bin) only flush to disk on
  chromadb's internal cadence, which can lag minutes behind sqlite
  under continuous write load.

Once the gap exceeds the 300s threshold, quarantine_stale_hnsw renames
a perfectly valid HNSW segment, chromadb rebuilds it from scratch, and
the cycle repeats as soon as the next batch of writes lands. The 300s
threshold (lowered from 3600s in PR MemPalace#1173 after a 0.96h-drift production
segfault) is correct for the cross-machine-replication failure mode it
was designed for, but wrong for a daemon-strict deployment whose only
"drift" source is its own benign flush lag.

Fix: gate the proactive quarantine check to the first ``make_client()``
invocation per palace per process (``ChromaBackend._quarantined_paths``
set). Real cold-start drift (replication, partial restore, crashed-mid-
write) still gets caught — that's exactly when a fresh daemon process
opens the palace. Real runtime drift on observed HNSW errors still gets
caught via palace-daemon's ``_auto_repair`` which calls
``quarantine_stale_hnsw`` directly, bypassing this gate.

Two new tests in test_backends.py verify single-fire-per-palace and
per-palace independence. Conftest clears the gate between tests.

Suite 1362/1362 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ush-lag

Previous: quarantine fired whenever sqlite_mtime - hnsw_mtime exceeded
the (lowered, in MemPalace#1173) 300s threshold. ChromaDB 1.5.x flushes HNSW
asynchronously and a clean shutdown does not force-flush, so the on-
disk HNSW is *always* meaningfully older than chroma.sqlite3 — that's
the steady state, not corruption. Quarantine renamed valid HNSW
segments on every cold-start, chromadb created empty replacements,
vector recall went to 0/N until rebuild.

Confirmed in production on the disks daemon journal, 2026-04-26
06:56:45: three of three HNSW segments quarantined on cold-start with
538-557s mtime gaps (post-clean-shutdown flush lag), leaving a
151,478-drawer palace with vector_ranked=0. Drift directories at
*.drift-20260426-065645/ each contained a complete 253MB data_level0.bin
plus 18MB index_metadata.pickle — clearly healthy indexes, renamed by
the false-positive heuristic.

Fix: two-stage gate.

  1. mtime gate (existing) — gap > stale_seconds is necessary.
  2. integrity gate (new) — sniff index_metadata.pickle for chromadb's
     expected protocol/terminator bytes (PROTO 0x80 head, STOP 0x2e
     tail) and a non-trivial size, WITHOUT deserializing the file.
     Healthy segment with mtime drift → keep in place; truncated /
     zero-filled / partial-flush → quarantine.

Format-sniff is deliberately non-deserializing — pickle deserialization
can execute arbitrary code, and the PROTO+STOP byte presence + size
floor is sufficient to distinguish a complete chromadb write from
truncation, zero-fill, or a partial flush during process kill. Real
load failures (the rare case where the bytes look right but chromadb
fails to load) still surface to palace-daemon's _auto_repair, which
calls quarantine_stale_hnsw directly on observed HNSW errors and
bypasses this gate.

The cold-start gate from 70c4bc6 (row 24) remains as a perf optimization
— even with the integrity check, repeating the sniff on every reconnect
is unnecessary work — but its load-bearing role is now covered by this
deeper fix.

4 new tests in test_backends.py:

  - test_quarantine_stale_hnsw_renames_corrupt_segment (drift + bad meta)
  - test_quarantine_stale_hnsw_leaves_healthy_segment_with_drift_alone
    (drift + valid meta — the production case at 06:24)
  - test_quarantine_stale_hnsw_leaves_segment_without_metadata_alone
    (fresh / never-flushed, no meta file)
  - test_quarantine_stale_hnsw_renames_truncated_metadata (under-floor
    size, partial-flush shape)

Existing test_quarantine_stale_hnsw_renames_drifted_segment renamed
to renames_corrupt_segment with explicit corrupt meta_bytes — the old
"renames any drift" contract is gone.

Suite 1366/1366 pass.

Coordinated cross-repo with palace-daemon's auto-repair-on-startup
workaround (separate agent's commit ed3a892). With this fork-side fix
the auto-repair becomes belt-and-suspenders; the structural cause
of empty-HNSW-on-restart is addressed at the quarantine layer.

CLAUDE.md row 26 + README fork-change-queue row + test count
1363→1366.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jphein

jphein commented Apr 26, 2026

Copy link
Copy Markdown
Collaborator Author

Quick update for visibility — rebased against current develop (which moved with #1210 / #1205 / #976 since your approval). Branch is now MERGEABLE again.

Three commits, layered:

  1. db80f6e — wire quarantine_stale_hnsw() into make_client() + threshold 3600→300s (the original change you approved)
  2. e5e7a57 — cold-start gate: ChromaBackend._quarantined_paths: set[str] so the proactive check fires once per palace per process, not on every reconnect (catches the steady-write-load thrash where a daemon's own writes bump sqlite mtime past the threshold while HNSW flushes batch behind)
  3. 74ff5e6 — integrity sniff-test on the chromadb segment metadata file (PROTO/STOP byte presence + size floor, no deserialization). Production case 2026-04-26 06:56:45 had three healthy 253MB segments quarantined on cold start by mtime alone — the integrity gate distinguishes async-flush lag from corruption.

Without commits 2 and 3 the original threshold lowering would regress on any high-write-rate deployment. Re-review welcome; tests still 34/34 on the PR branch.

@igorls

igorls commented Apr 26, 2026

Copy link
Copy Markdown
Member

Thanks for chasing this down — the two-stage check (mtime gate + index_metadata.pickle sniff) is the right shape, and the production evidence on the 151K-drawer palace is convincing. A few notes before merge:

Blocking — unrelated test deletions

The diff removes the three _pin_hnsw_threads tests from tests/test_backends.py, but _pin_hnsw_threads is still live code on develop:

  • defined at mempalace/backends/chroma.py:133
  • called from mempalace/backends/chroma.py:604 (every get_collection)
  • called from mempalace/mcp_server.py:231 and :237
  • still imported at tests/test_backends.py:19

That import is now unused, which will fail ruff check under the CI-pinned 0.4.x with F401, and coverage drops on a function that addresses a different bug (#974/#965 HNSW race retrofit) than this PR. Could you either restore the three tests or split their removal into a separate PR? The deletion isn't motivated by the PR description.

Thread-safety nit

ChromaBackend._quarantined_paths is a class-level set mutated without a lock. In a multi-threaded server two make_client() calls for the same palace can both pass the membership check and both invoke quarantine_stale_hnsw. The function is idempotent (mtime check + timestamped rename), so the race is benign in practice — worth either a threading.Lock or a one-line note in the attribute docstring that idempotency is the safety property.

Pickle protocol assumption

_segment_appears_healthy requires head[0] == 0x80, i.e. pickle protocol ≥ 2. That matches what chromadb writes today, so this is fine — just worth calling out in the docstring so a future chromadb change to protocol 0/1 doesn't silently start quarantining healthy segments.

Scope-of-state nit

_quarantined_paths is class-level, which is correct for "per-process" but couples test isolation to the conftest.py reset. A module-level _QUARANTINED: set[str] = set() matches the actual scope and removes the foot-gun if anyone later treats ChromaBackend as having instance-scoped caching.

The core change is sound; the blocker above is the only thing that needs to move before merge.

Restore three `_pin_hnsw_threads` tests that the previous integrity-gate
commit deleted. The function is still live code on develop (defined at
chroma.py:207, called from chroma.py:705 + mcp_server.py), so the
deletion left the import unused (ruff F401) and dropped coverage on a
function unrelated to this PR's scope. Restored verbatim from main.

Plus three nits @igorls flagged:

- **Thread-safety doc**: `_quarantined_paths` mutation is lock-free;
  documented that idempotency of `quarantine_stale_hnsw` is the safety
  property (concurrent same-palace calls produce a benign redundant
  rename attempt that no-ops, no need for a lock).
- **Pickle protocol assumption**: `_segment_appears_healthy` requires
  PROTO ≥ 2 (`0x80`). Documented; matches what chromadb writes today,
  and a future protocol-0/1 emission would conservatively quarantine
  + lazy-rebuild rather than mis-classify as healthy.
- Class-level vs module-level scope: keeping class-level — the
  conftest reset is the controlled case, and module-level wouldn't
  remove the foot-gun, just relocate it. Conftest reset documented in
  the existing comment is the right pattern for test isolation.

Style nit (`Path(marker).touch()` vs `open(marker, "a").close()`)
deferred — that pattern lives in MemPalace#1177's territory, not MemPalace#1173's.

37/37 tests pass on the PR branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jphein

jphein commented Apr 26, 2026

Copy link
Copy Markdown
Collaborator Author

@igorls — thanks, all addressed in `942aae3`:

Blocker: restored the three `_pin_hnsw_threads` tests verbatim from develop. Confirmed locally that the import was indeed unused on the prior diff state; with the tests back, F401 clears and coverage on the retrofit (the #974/#965 ParallelFor mitigation) is whole again. The deletion was scope-creep that crept in during the cherry-pick onto this branch — agree it doesn't belong in this PR.

Thread-safety nit: documented the idempotency-as-safety-property in the `_quarantined_paths` block comment. Concurrent same-palace `make_client` calls produce a benign redundant rename attempt that no-ops on the second pass (mtime check finds the segment already renamed). Adding a lock would cost without correctness gain.

Pickle protocol nit: documented the PROTO ≥ 2 assumption (`0x80`) in `_segment_appears_healthy`'s docstring. Matches what chromadb writes today; a future protocol-0/1 emission would conservatively trigger quarantine + lazy rebuild rather than silently mis-classify, which is the right failure mode.

Class-level vs module-level scope: kept class-level. The conftest reset is the controlled isolation pattern, and module-level wouldn't remove the foot-gun (anyone treating `ChromaBackend` as instance-scoped would also miss a module-level cache). The existing comment block makes the per-process scope explicit.

Style nit (`Path(marker).touch()`): that's #1177's territory rather than this PR's. Folded into the next #1177 push since you asked for marker tests there too.

37/37 backend tests pass on the rebased branch. Ready for re-review.

@igorls igorls merged commit 4aa93e8 into MemPalace:develop Apr 26, 2026
6 checks passed
jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 26, 2026
Brings in upstream's corpus-origin + privacy-warning track (PRs MemPalace#1211
MemPalace#1221 MemPalace#1223 MemPalace#1224 MemPalace#1225) plus the canonical merged versions of our four
PRs that landed today (21:22-21:41 UTC):

  MemPalace#1173  quarantine_stale_hnsw on make_client + cold-start gate +
         integrity sniff-test (PROTO/STOP byte check, no deserialization)
  MemPalace#1177  .blob_seq_ids_migrated marker guard, closes MemPalace#1090
  MemPalace#1198  _tokenize None-document guard in BM25 reranker
  MemPalace#1201  palace_graph.build_graph skips None metadata

Conflict resolution:

* mempalace/backends/chroma.py — took upstream as base (it has the
  igorls-review pickle-protocol docstring, thread-safety paragraph, and
  Path(marker).touch() style nit), then re-applied MemPalace#1094's _coerce_none_metas
  in query()/get() since MemPalace#1094 is still open and not yet in develop.

* mempalace/mcp_server.py — took upstream's clean form. Dropped the
  fork-only `palace_path=` kwarg from four ChromaCollection() call sites:
  the kwarg was load-bearing for MemPalace#1171's per-collection write lock, but
  MemPalace#1171 closed in favor of MemPalace#976's mine_global_lock + daemon-strict, so
  the kwarg has no remaining consumer. ChromaCollection.__init__ in
  upstream/develop is back to (self, collection); calling it with
  palace_path= raised TypeError → silently swallowed by the broad except
  in _get_collection() → returned None → tool_status() returned _no_palace().
  41 mcp_server tests went from failing-with-KeyError to passing.

* mempalace/cli.py — dropped fork-only `workers=args.workers` from the
  cmd_mine -> miner.mine() call. Pre-existing fork-side bug: the
  `--workers` argparse arg landed in 5cd14bd but miner.mine() never
  accepted a workers param, so production `mempalace mine` TypeError'd
  on every invocation. Removed the broken plumbing; tests/test_cli.py
  updated to match.

* CHANGELOG.md — took upstream verbatim. Fork-specific changelog lives
  in FORK_CHANGELOG.md (canonical: docs/fork-changes.yaml).

* CLAUDE.md — kept ours. Fork's CLAUDE.md is operational; upstream's
  added a "Design Principles / Contributing" charter, which lives in
  README.md on the fork.

* tests/test_backends.py — took upstream's ruff-formatted line widths.

docs/fork-changes.yaml flips the two MemPalace#1173 entries (hnsw-integrity-gate,
hnsw-cold-start-gate) and the MemPalace#1201 entry (palace-graph-none-guard) from
OPEN to MERGED 2026-04-26. MemPalace#1173 MemPalace#1177 MemPalace#1198 MemPalace#1201 added to the
merged_upstream archive at the bottom. FORK_CHANGELOG.md regenerated.

scripts/check-docs.sh: 4/4 clean.
Test suite: 1460/1460.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 27, 2026
Upstream merged MemPalace#1173, MemPalace#1177, MemPalace#1198, MemPalace#1201 into develop on 2026-04-26
(picked up by the 2026-04-27 develop sync). Strike out their fork-ahead
rows in CLAUDE.md and add to the "Merged into upstream" section. Update
PR table accordingly: 18 merged (was 14), 7 open (was 8). Bump version
note: upstream shipped v3.3.3 on 2026-04-24 (carries our MemPalace#659/MemPalace#1021);
v3.3.4 prep branch in flight.

README:
- "tracks upstream/develop through the 2026-04-27 sync" (was 2026-04-26)
- 17 fork-ahead changes (was 22)
- 1,510 tests pass on main (was 1,395 — upstream brought new suites)
- "Open upstream PRs" 7 entries (was 11)
- Drop merged rows from "Fork-ahead — open or pending" table; keep
  PreCompact recovery-marker row (still fork-only)

scripts/check-docs.sh: clean (90 PR refs match upstream state, 13 fork
hashes resolve, FORK_CHANGELOG renders clean from YAML).

docs/fork-changes.yaml: no edit needed — already had merged_date on the
4 entries from the 2026-04-26 commit `5fd15db`.
lealvona pushed a commit to lealvona/mempalace that referenced this pull request Apr 29, 2026
Symptom on the canonical disks daemon: drift quarantines firing every
10–30 minutes throughout the day under steady write load. Logs show
.drift-* directories accumulating despite the daemon being the only
writer (no Syncthing replication of palace data).

Root cause is a false-positive thrash in the quarantine heuristic:

- chroma.sqlite3 mtime bumps on every write (millisecond cadence).
- HNSW segment files (data_level0.bin) only flush to disk on
  chromadb's internal cadence, which can lag minutes behind sqlite
  under continuous write load.

Once the gap exceeds the 300s threshold, quarantine_stale_hnsw renames
a perfectly valid HNSW segment, chromadb rebuilds it from scratch, and
the cycle repeats as soon as the next batch of writes lands. The 300s
threshold (lowered from 3600s in PR MemPalace#1173 after a 0.96h-drift production
segfault) is correct for the cross-machine-replication failure mode it
was designed for, but wrong for a daemon-strict deployment whose only
"drift" source is its own benign flush lag.

Fix: gate the proactive quarantine check to the first ``make_client()``
invocation per palace per process (``ChromaBackend._quarantined_paths``
set). Real cold-start drift (replication, partial restore, crashed-mid-
write) still gets caught — that's exactly when a fresh daemon process
opens the palace. Real runtime drift on observed HNSW errors still gets
caught via palace-daemon's ``_auto_repair`` which calls
``quarantine_stale_hnsw`` directly, bypassing this gate.

Two new tests in test_backends.py verify single-fire-per-palace and
per-palace independence. Conftest clears the gate between tests.

Suite 1362/1362 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lealvona pushed a commit to lealvona/mempalace that referenced this pull request Apr 29, 2026
…ush-lag

Previous: quarantine fired whenever sqlite_mtime - hnsw_mtime exceeded
the (lowered, in MemPalace#1173) 300s threshold. ChromaDB 1.5.x flushes HNSW
asynchronously and a clean shutdown does not force-flush, so the on-
disk HNSW is *always* meaningfully older than chroma.sqlite3 — that's
the steady state, not corruption. Quarantine renamed valid HNSW
segments on every cold-start, chromadb created empty replacements,
vector recall went to 0/N until rebuild.

Confirmed in production on the disks daemon journal, 2026-04-26
06:56:45: three of three HNSW segments quarantined on cold-start with
538-557s mtime gaps (post-clean-shutdown flush lag), leaving a
151,478-drawer palace with vector_ranked=0. Drift directories at
*.drift-20260426-065645/ each contained a complete 253MB data_level0.bin
plus 18MB index_metadata.pickle — clearly healthy indexes, renamed by
the false-positive heuristic.

Fix: two-stage gate.

  1. mtime gate (existing) — gap > stale_seconds is necessary.
  2. integrity gate (new) — sniff index_metadata.pickle for chromadb's
     expected protocol/terminator bytes (PROTO 0x80 head, STOP 0x2e
     tail) and a non-trivial size, WITHOUT deserializing the file.
     Healthy segment with mtime drift → keep in place; truncated /
     zero-filled / partial-flush → quarantine.

Format-sniff is deliberately non-deserializing — pickle deserialization
can execute arbitrary code, and the PROTO+STOP byte presence + size
floor is sufficient to distinguish a complete chromadb write from
truncation, zero-fill, or a partial flush during process kill. Real
load failures (the rare case where the bytes look right but chromadb
fails to load) still surface to palace-daemon's _auto_repair, which
calls quarantine_stale_hnsw directly on observed HNSW errors and
bypasses this gate.

The cold-start gate from 70c4bc6 (row 24) remains as a perf optimization
— even with the integrity check, repeating the sniff on every reconnect
is unnecessary work — but its load-bearing role is now covered by this
deeper fix.

4 new tests in test_backends.py:

  - test_quarantine_stale_hnsw_renames_corrupt_segment (drift + bad meta)
  - test_quarantine_stale_hnsw_leaves_healthy_segment_with_drift_alone
    (drift + valid meta — the production case at 06:24)
  - test_quarantine_stale_hnsw_leaves_segment_without_metadata_alone
    (fresh / never-flushed, no meta file)
  - test_quarantine_stale_hnsw_renames_truncated_metadata (under-floor
    size, partial-flush shape)

Existing test_quarantine_stale_hnsw_renames_drifted_segment renamed
to renames_corrupt_segment with explicit corrupt meta_bytes — the old
"renames any drift" contract is gone.

Suite 1366/1366 pass.

Coordinated cross-repo with palace-daemon's auto-repair-on-startup
workaround (separate agent's commit ed3a892). With this fork-side fix
the auto-repair becomes belt-and-suspenders; the structural cause
of empty-HNSW-on-restart is addressed at the quarantine layer.

CLAUDE.md row 26 + README fork-change-queue row + test count
1363→1366.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lealvona pushed a commit to lealvona/mempalace that referenced this pull request Apr 29, 2026
Restore three `_pin_hnsw_threads` tests that the previous integrity-gate
commit deleted. The function is still live code on develop (defined at
chroma.py:207, called from chroma.py:705 + mcp_server.py), so the
deletion left the import unused (ruff F401) and dropped coverage on a
function unrelated to this PR's scope. Restored verbatim from main.

Plus three nits @igorls flagged:

- **Thread-safety doc**: `_quarantined_paths` mutation is lock-free;
  documented that idempotency of `quarantine_stale_hnsw` is the safety
  property (concurrent same-palace calls produce a benign redundant
  rename attempt that no-ops, no need for a lock).
- **Pickle protocol assumption**: `_segment_appears_healthy` requires
  PROTO ≥ 2 (`0x80`). Documented; matches what chromadb writes today,
  and a future protocol-0/1 emission would conservatively quarantine
  + lazy-rebuild rather than mis-classify as healthy.
- Class-level vs module-level scope: keeping class-level — the
  conftest reset is the controlled case, and module-level wouldn't
  remove the foot-gun, just relocate it. Conftest reset documented in
  the existing comment is the right pattern for test isolation.

Style nit (`Path(marker).touch()` vs `open(marker, "a").close()`)
deferred — that pattern lives in MemPalace#1177's territory, not MemPalace#1173's.

37/37 tests pass on the PR branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RasimAbiyev pushed a commit to RasimAbiyev/mempalace that referenced this pull request May 3, 2026
…event SIGSEGV on stale HNSW (MemPalace#1121, MemPalace#1132, MemPalace#1263)

PR MemPalace#1173 wired quarantine_stale_hnsw into the static make_client() helper
but not into the instance _client() method. As a result every non-MCP
entry point (CLI mining, search, repair, status) — which all use
get_collection / _get_or_create_collection / _client() — skipped the
cold-start quarantine pass and could SIGSEGV on a stale HNSW segment
left over from a partial flush, replicated palace, or crashed-mid-write.

Refactor: extract the (_fix_blob_seq_ids + gated quarantine_stale_hnsw)
pre-open pass into a single private static helper
ChromaBackend._prepare_palace_for_open(). Both make_client() and
_client() now route through it, so the _quarantined_paths once-per-
palace-per-process gate is preserved (no runtime thrash on hot paths)
and behaviour stays identical — the fix is purely about extending the
existing protection to the path that was missing it.

Tests:

- test_client_quarantines_corrupt_segment_on_first_open mirrors the
  existing make_client test and verifies _client() actually renames a
  corrupt segment on first open.
- test_client_quarantines_only_on_first_call_per_palace verifies the
  cache gate prevents re-running quarantine across repeated _client()
  calls — important because _client() is hit on every backend op.

Closes MemPalace#1121. Closes MemPalace#1132. Closes MemPalace#1263.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
messelink added a commit to messelink/mempalace that referenced this pull request May 18, 2026
…itecture

Empirically demonstrated 2026-05-18 against a ~100K-drawer palace: the
plugin's SSH-proxy pattern spawns a fresh mempalace.mcp_server process
on the central host per remote session, which combined with the
quarantine_stale_hnsw-on-make_client() behavior (PR MemPalace#1173) silently
invalidates other live MCP processes on every Claude Code cold-start
anywhere in the fleet. Subsequent writes from those stale processes
destroy on-disk HNSW segments (data_level0.bin zeroed,
index_metadata.pickle deleted) within seconds.

Adding SSH on top of a per-session-spawn MCP model doesn't help — it
gives more clients a way to add more concurrent MCP processes to the
race. The plugin architecture needs to be replaced with a single-
process gateway (palace-daemon by rboarescu is one such), not patched.

Filed upstream as MemPalace#1533 (post-swap stale destruction) and MemPalace#1545
(rebuild-time auto-quarantine destroying the recovery tool itself).
This README warning surfaces the architectural limitation before any
user reaches the install steps below.

PR MemPalace#1190 stays open as documentation of what doesn't work and why.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rglaubitz pushed a commit to rglaubitz/mempalace that referenced this pull request May 26, 2026
…event SIGSEGV on stale HNSW (MemPalace#1121, MemPalace#1132, MemPalace#1263)

PR MemPalace#1173 wired quarantine_stale_hnsw into the static make_client() helper
but not into the instance _client() method. As a result every non-MCP
entry point (CLI mining, search, repair, status) — which all use
get_collection / _get_or_create_collection / _client() — skipped the
cold-start quarantine pass and could SIGSEGV on a stale HNSW segment
left over from a partial flush, replicated palace, or crashed-mid-write.

Refactor: extract the (_fix_blob_seq_ids + gated quarantine_stale_hnsw)
pre-open pass into a single private static helper
ChromaBackend._prepare_palace_for_open(). Both make_client() and
_client() now route through it, so the _quarantined_paths once-per-
palace-per-process gate is preserved (no runtime thrash on hot paths)
and behaviour stays identical — the fix is purely about extending the
existing protection to the path that was missing it.

Tests:

- test_client_quarantines_corrupt_segment_on_first_open mirrors the
  existing make_client test and verifies _client() actually renames a
  corrupt segment on first open.
- test_client_quarantines_only_on_first_call_per_palace verifies the
  cache gate prevents re-running quarantine across repeated _client()
  calls — important because _client() is hit on every backend op.

Closes MemPalace#1121. Closes MemPalace#1132. Closes MemPalace#1263.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rglaubitz added a commit to rglaubitz/mempalace that referenced this pull request May 26, 2026
* fix(mcp_server): pass embedding_function= on collection reopen (MemPalace#1299)

`mcp_server._get_collection` bypassed `ChromaBackend.get_collection`
and called `client.get_collection` / `client.create_collection` without
`embedding_function=`. ChromaDB 1.x does not persist the EF identity
with the collection, so the MCP server's reopen silently bound
chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop
hook ingest path bound `mempalace.embedding.get_embedding_function()`.

On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple
Silicon, per MemPalace#1299) the default EF's lazy ONNX provider selection could
SIGSEGV the host process on first `col.add()`, killing the MCP stdio
server and leaving every subsequent tool call returning
`Connection closed` until Claude Code was relaunched. Reads worked
because `col.get(ids=...)` and metadata fetches don't invoke the EF;
the auto-ingest path worked because mining routes through the backend
abstraction. Diary writes were the consistent failure surface.

Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`)
and pass it into both reopen branches. Falls back to the chromadb default
only if `mempalace.embedding.get_embedding_function` itself raises.

Regression test patches the chromadb client class to capture
`embedding_function=` on every `get_collection` / `create_collection`
call from `_get_collection(create=True)` and `_get_collection()`, and
fails if any call omits it.

Follow-up to MemPalace#1262 / MemPalace#1289 (which fixed the metadata-mismatch SIGSEGV
path); this addresses the EF-mismatch SIGSEGV path on the same surface.

* fix(mcp_server): address copilot review on MemPalace#1303

- Resolve the EF inside the two reopen branches that actually call
  `client.get_collection` / `client.create_collection`, so warm-cache
  reads stay zero-cost (no `MempalaceConfig()` / `_resolve_providers`
  on every tool call).
- Reuse `ChromaBackend._resolve_embedding_function()` instead of
  duplicating its try/except + log message + None-fallback.
- Reword the inline + CHANGELOG explanation to clarify that ChromaDB 1.x
  persists the EF *identity* (its `name()`) but not the *instance/
  configuration* — `mempalace.embedding` documents this and spoofs
  `name()` to `"default"` precisely so the identity check passes; the
  bug was the *provider list* (lazy ONNX selection) silently differing.

* fix(hooks): quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json (MemPalace#1076) (MemPalace#1077)

Shell splits hook command on whitespace after variable expansion, breaking
paths with spaces (e.g. C:\Users\Richard M on Windows). Wrapping the path
in double quotes preserves the token boundary.

Fixes the reported Stop/PreCompact pair in .claude-plugin/hooks/hooks.json
and applies the same fix to .codex-plugin/hooks.json (SessionStart/Stop/
PreCompact), which carries the identical bug.

* fix(cli): write compress output to mempalace_closets so palace can read them (MemPalace#1244)

`cmd_compress` was writing AAAK-compressed drawers to a `mempalace_compressed`
collection, but every read path (`palace.get_closets_collection`,
`searcher.py`, `repair.py`) reads from `mempalace_closets`. Result: for
non-mined palaces (or any palace where the user ran `mempalace compress`
expecting to backfill the closet/index layer), the compressed output was
silently invisible — written to a collection nothing else opens.

Fix the writer rather than renaming the readers: "closets" is the
user-visible feature name baked into the public API
(`get_closets_collection`), the searcher hybrid path, repair/HNSW
diagnostics, and docs. Renaming the readers would churn 15+ call sites
and the README for no benefit. The compressed AAAK strings are exactly
what closets are conceptually — compact pointers scanned by an LLM to
locate the right drawer — so they belong in `mempalace_closets`.

Tests:
- Update `test_cmd_compress_stores_results` to assert the collection
  name passed to `get_or_create_collection` is `mempalace_closets`.
- Add `test_cmd_compress_output_readable_via_get_closets_collection`:
  end-to-end with a real ChromaBackend, seed a drawer, run cmd_compress,
  then read back via the same `get_closets_collection` helper that
  palace.py / searcher use. Regression test for the wrong-collection
  bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: ruff format cli.py (MemPalace#1244)

CI requires ruff format --check on the whole touched file. Pre-existing drift, no logic change.

* style: ruff format tests/test_cli.py (PR MemPalace#1319)

* fix(mcp): case-insensitive agent name in diary_write/diary_read (MemPalace#1243)

`tool_diary_write` stored the `agent` metadata verbatim after `sanitize_name`
(which preserves case), while `tool_diary_read` filtered by exact match —
so writing as "Claude" and reading as "claude" silently returned zero rows.

Both endpoints now lowercase `agent_name` immediately after sanitization.
The default per-agent wing slug is also stable across casings since it's
derived from the same normalized form.

Behavior change: entries written prior to this fix under mixed-case agent
names will not match the new lowercase filter; documented under v3.3.5
in CHANGELOG with a `mempalace repair` pointer.

Adds a regression test (`test_diary_read_case_insensitive_agent`) and
updates the existing `test_diary_write_and_read` to assert the new
lowercase agent identity.

Closes MemPalace#1243

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: ruff format tests/test_mcp_server.py (PR MemPalace#1323)

* fix(backends/chroma): wire quarantine_stale_hnsw into _client() to prevent SIGSEGV on stale HNSW (MemPalace#1121, MemPalace#1132, MemPalace#1263)

PR MemPalace#1173 wired quarantine_stale_hnsw into the static make_client() helper
but not into the instance _client() method. As a result every non-MCP
entry point (CLI mining, search, repair, status) — which all use
get_collection / _get_or_create_collection / _client() — skipped the
cold-start quarantine pass and could SIGSEGV on a stale HNSW segment
left over from a partial flush, replicated palace, or crashed-mid-write.

Refactor: extract the (_fix_blob_seq_ids + gated quarantine_stale_hnsw)
pre-open pass into a single private static helper
ChromaBackend._prepare_palace_for_open(). Both make_client() and
_client() now route through it, so the _quarantined_paths once-per-
palace-per-process gate is preserved (no runtime thrash on hot paths)
and behaviour stays identical — the fix is purely about extending the
existing protection to the path that was missing it.

Tests:

- test_client_quarantines_corrupt_segment_on_first_open mirrors the
  existing make_client test and verifies _client() actually renames a
  corrupt segment on first open.
- test_client_quarantines_only_on_first_call_per_palace verifies the
  cache gate prevents re-running quarantine across repeated _client()
  calls — important because _client() is hit on every backend op.

Closes MemPalace#1121. Closes MemPalace#1132. Closes MemPalace#1263.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: ruff format touched files (PR MemPalace#1322)

CI requires whole-file format on touched files; pre-existing drift only.

* style: ruff format tests/test_backends.py (PR MemPalace#1322)

* fix(mcp): omit palace_path from tool_status responses (+ docs)

The MCP `mempalace_status` tool was returning the server's absolute
`_config.palace_path` to any connected client on both the main
(ChromaDB-backed) path and the sqlite fallback path that runs when
HNSW divergence is detected (MemPalace#1222). On a single-user local deployment
this is self-disclosure, but in nested-agent or multi-server MCP
topologies the client is a separate trust domain and the absolute
path has no documented client-side use.

Clients that legitimately need the palace path continue to have three
documented channels: the `MEMPALACE_PALACE_PATH` env var (primary) or
its legacy `MEMPAL_PALACE_PATH` alias, the `~/.mempalace/config.json`
file, and the `--palace` CLI flag on most subcommands.

Also corrects stale docs that claimed `mempalace_reconnect` returned a
`palace_path` field; the code returns `{success, message, drawers,
vector_disabled[, vector_disabled_reason]}` on success, plus a no-palace
shape and an exception shape.

- mempalace/mcp_server.py: drop palace_path from tool_status() and
  _tool_status_via_sqlite() result dicts
- website/reference/mcp-tools.md: update documented return shapes for
  mempalace_status (fix) and mempalace_reconnect (stale-docs correction)

Authored-by: Aaron Salsitz (ICCI LLC, @icciaaron). Claude Code was used
as an authoring and review-orchestration tool, with human-in-the-loop
oversight at every step: Aaron wrote the prompts, reviewed each draft,
called for three independent review passes (drafting / post-rebase
technical / CISA-aligned disclosure-leak), and verified the final patch
behavior before commit.

* fix(mcp): basename source_file in tool_get_drawer responses

The MCP `mempalace_get_drawer` tool returned the entire raw drawer
metadata blob to any connected client, and the `source_file` field
in that blob is the absolute filesystem path written by the miners
(`miner.py`, `convo_miner.py` — `source_file = str(filepath)`). On
a single-user local deployment this is self-disclosure, but in
nested-agent or multi-server MCP topologies the client is a separate
trust domain and the host's directory layout has no documented
client-side use.

Mirror the mitigation that `searcher.search_memories()` already applies
on its own return path: reduce `source_file` to its basename via
`Path(source_file).name` before handing the metadata to the client.
Citations still work — the directory layout does not leak.

Companion to #1 (omit palace_path from tool_status). Same threat class,
different surface:

- mempalace_status — palace dir path     → fixed in #1
- mempalace_get_drawer — per-drawer source_file path → this PR

Other read tools were audited and do not leak host paths:
- mempalace_search    — already basenames source_file
- mempalace_list_drawers — returns wing/room/preview only
- mempalace_diary_read   — date/timestamp/topic/content only
- mempalace_reconnect    — success/message/drawers only
- mempalace_kg_*         — entity/predicate strings, counts
- mempalace_check_duplicate — wing/room/preview only

Changes:
- mempalace/mcp_server.py: tool_get_drawer() now basenames metadata.source_file
- tests/test_mcp_server.py: regression test asserting the absolute path
  and its parent directory do not appear anywhere in the response
- website/reference/mcp-tools.md: clarify the documented return shape

* feat(searcher): add candidate_strategy="union" for vector∪BM25 reranking pool

Default search behavior is unchanged. Opt-in candidate_strategy="union"
also pulls top-K BM25-only candidates from sqlite FTS5 and merges them
into the rerank pool, catching docs with strong BM25 signal that the
vector index didn't surface in the over-fetch window.

Motivation
----------
The current hybrid path gathers candidates from the vector index only
(n_results * 3 over-fetch), then BM25-reranks within them. When the
query embeds close to the wrong content semantically, the right doc
never enters the rerank pool — *no matter how wide the over-fetch*.
Tested on a ~6K-document mixed corpus (knowledge prose + short structured
records): at *30x* over-fetch (~5% of the corpus) the target doc still
didn't surface for narrative-shaped queries targeting terminology guides.
Wider over-fetch isn't the answer; widening the pool's *source* is.

Concrete failure mode: a narrative-shaped query embeds close to records
sharing the same operational vocabulary (other narrative entries in the
corpus). A terminology / style guide is BM25-strong for the query
(rare keywords the guide repeats) but vector-distant. Vector-only
candidates don't include it; BM25 never gets to rerank it. The hybrid
path produces 0.00 recall on a probe that pure BM25 alone scores 1.00 —
the hybrid is worse than its component on the same input.

Behavior change
---------------
* New parameter ``candidate_strategy: str = "vector"`` on ``search_memories``.
  - ``"vector"`` (default): historical behavior, no change.
  - ``"union"``: also fetch top ``n_results * 3`` candidates via the
    existing ``_bm25_only_via_sqlite`` helper, dedupe by source_file,
    merge into the rerank pool. BM25-only candidates carry
    ``distance=None`` so they're scored on BM25 contribution alone
    (vec_sim coerces to 0).
* ``_hybrid_rank`` now handles ``distance=None`` explicitly, scoring
  such candidates as vector-unknown (vec_sim=0) rather than treating
  it as max-distance via shim.
* New strategies register via ``_CANDIDATE_MERGERS``; dispatch is in
  ``_apply_candidate_strategy`` so ``search_memories`` stays under the
  C901 complexity ceiling.

Bench numbers (~6K-doc internal mixed corpus, recall@10, 5 probes spanning
policy-exception lookup, temporal-decay, style retrieval, set-difference,
and pattern-recognition):

                              baseline ("vector")   "union"
  policy-exception probe        0.00                  0.50    +0.50
  temporal-decay probe          0.17                  0.50    +0.33
  style-retrieval probe         0.00                  1.00    +1.00 (PASSES)
  set-difference probe          0.00–0.06             0.06–0.09  ~
  pattern-recog probe           0.64 (stable)         0.50–0.71  variance, typ. +0.07
  macro recall                  0.16–0.17             0.51–0.56  +0.34 to +0.40

The pattern-recog variance points at a related issue worth a separate PR:
``_hybrid_rank`` computes BM25 IDF over the candidate set. Adding new
candidates re-normalizes BM25 for *existing* candidates non-monotonically.
Stable corpus-wide BM25 would remove this. Out of scope here.

Tests
-----
``tests/test_hybrid_candidate_union.py`` (6 tests, all pass):
- default behavior unchanged (explicit ``"vector"`` matches default)
- ``"union"`` surfaces a BM25-strong vector-distant doc
- ``"union"`` doesn't drop docs ``"vector"`` would have found
- empty-palace handling
- invalid ``candidate_strategy`` raises
- ``_hybrid_rank`` tolerates ``distance=None``

Existing ``test_hybrid_search.py`` (5) and ``test_searcher.py`` (27) pass.

Performance note
----------------
Each ``"union"`` query adds one sqlite open + FTS5 MATCH + metadata
fetch (via the existing ``_bm25_only_via_sqlite`` helper, which already
runs as the ``vector_disabled`` fallback path so the code is
well-trodden). Per-query overhead is small but unmeasured at corpus
scale. Default stays ``"vector"`` until a maintainer characterizes the
cost.

* fix(searcher): address Copilot review on MemPalace#1306

- Dedup union candidates by (full_path, chunk_index), not basename —
  two files sharing a basename in different dirs no longer collide,
  and a vector hit on chunk N of a file no longer blocks BM25 from
  contributing chunk M of the same file.
- Validate candidate_strategy at the top of search_memories so invalid
  values fail consistently, not only when the call routes through the
  vector path.
- Trim hits back to n_results after the union+rerank pool grows;
  preserves the existing search_memories size contract that the MCP
  limit parameter is built on.
- Skip BM25-only injection when max_distance > 0.0; BM25-only
  candidates carry distance=None and would silently bypass the
  caller's strict vector-distance threshold.

Adds 4 tests covering: validation under vector_disabled, n_results
trim, max_distance honoring, and basename-collision dedup.

* fix(mcp): forward valid_to and source params in kg_add/kg_invalidate (MemPalace#1314)

`tool_kg_add` previously accepted only `valid_from` and `source_closet`,
silently dropping `valid_to`, `source_file`, and `source_drawer_id` at
the MCP boundary. Backfilling already-ended historical facts therefore
collapsed to "still current," and adapter provenance never reached
the SQLite layer even though `KnowledgeGraph.add_triple` already
supported every column.

`tool_kg_invalidate` returned the literal string `"today"` whenever the
caller omitted `ended`, hiding the actual stamped date from anyone trying
to verify what got persisted.

Changes:
- Extend `tool_kg_add` signature + MCP input_schema with `valid_to`,
  `source_file`, `source_drawer_id`; forward all of them to
  `_kg.add_triple` and to the WAL log.
- Resolve `ended` to `date.today().isoformat()` in `tool_kg_invalidate`
  before logging / returning, so the response always reports the actual
  date stored in `valid_to`.
- Add regression tests for valid_to round-trip, source_file /
  source_drawer_id provenance, and the resolved-ended-date contract.
- Leave TODO(MemPalace#1283) markers so the open ISO-8601 validation PR can drop
  `validate_iso_date` over `valid_from` / `valid_to` / `ended` cleanly.

The underlying `KnowledgeGraph.add_triple` already accepted these
kwargs (RFC 002 §5.5) — only the MCP edge needed wiring up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: ruff format test_mcp_server.py (PR MemPalace#1320)

* docs(mcp): drop §5.5 from kg_add docstring/schema

The repo's anti-jargon meta-test bans §N markers outside the
sources/backends allowlist. mcp_server.py isn't allowlisted, so the
"RFC 002 §5.5" references added in this PR turned the test red.
Trim to "RFC 002" — section number isn't load-bearing for the description.

* fix(cli): honor --palace flag in cmd_init (MemPalace#1313)

cmd_init was instantiating MempalaceConfig() unconditionally, ignoring
args.palace and always writing the palace under ~/.mempalace. Mirror
the env-var pattern used by mcp_server.py (and consistent with how
cmd_mine / cmd_status / cmd_search resolve --palace) so every
downstream read of cfg.palace_path inside cmd_init — Pass 0,
cfg.init(), and the post-init mine — routes to the user-specified
location.

Adds tests/test_cli.py::test_cmd_init_honors_palace_flag covering the
regression: asserts Pass 0 receives the --palace value (not
~/.mempalace) and that MEMPALACE_PALACE_PATH is set in os.environ.

Closes MemPalace#1313.

* test(cli): prime monkeypatch undo so palace env doesn't leak

monkeypatch.delenv(name, raising=False) on a missing key registers no
undo entry, so the env var cmd_init writes leaked into test_config_from_file
on Python 3.13 / Windows / macOS.

Prime the slot with setenv before delenv so teardown rolls back the write.

* feat(mcp): persist drawer metadata

* fix: address metadata passthrough review blockers

* Close metadata passthrough review findings

F03 now logs corrupted drawer metadata decode failures through the mempalace_mcp logger, truncating the stored value to value_head so operators get an identifier without leaking the full payload.

F15 is Case B: ChromaBackend._resolve_embedding_function() can return None because it catches any exception from get_embedding_function() and returns None after logging; ChromaBackend.get_collection() mirrors the same conditional kwargs behavior and does not repair that case.

For the MCP _get_collection call site, treat a None EF as a fail-closed condition and never reopen a collection without embedding_function=, so the MemPalace#1299 default-EF regression path cannot silently reappear there.

Validation: uv run pytest -q; uv run ruff check .

---------

Co-authored-by: Igor Lins e Silva <4753812+igorls@users.noreply.github.com>
Co-authored-by: Mikhail Valentsev <michael@valentsev.ru>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: icciAaron <asalsitz@icci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants