fix(mcp_server): pass embedding_function= on collection reopen (#1299)#1303
Conversation
`mcp_server._get_collection` bypassed `ChromaBackend.get_collection` and called `client.get_collection` / `client.create_collection` without `embedding_function=`. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop hook ingest path bound `mempalace.embedding.get_embedding_function()`. On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon, per #1299) the default EF's lazy ONNX provider selection could SIGSEGV the host process on first `col.add()`, killing the MCP stdio server and leaving every subsequent tool call returning `Connection closed` until Claude Code was relaunched. Reads worked because `col.get(ids=...)` and metadata fetches don't invoke the EF; the auto-ingest path worked because mining routes through the backend abstraction. Diary writes were the consistent failure surface. Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`) and pass it into both reopen branches. Falls back to the chromadb default only if `mempalace.embedding.get_embedding_function` itself raises. Regression test patches the chromadb client class to capture `embedding_function=` on every `get_collection` / `create_collection` call from `_get_collection(create=True)` and `_get_collection()`, and fails if any call omits it. Follow-up to #1262 / #1289 (which fixed the metadata-mismatch SIGSEGV path); this addresses the EF-mismatch SIGSEGV path on the same surface.
There was a problem hiding this comment.
Pull request overview
Fixes an MCP-server-specific ChromaDB collection reopen path to consistently bind MemPalace’s embedding function (instead of falling back to ChromaDB’s default EF), preventing EF-mismatch-induced crashes and aligning MCP write behavior with the miner/backend path.
Changes:
- Update
mcp_server._get_collectionto resolve an embedding function and passembedding_function=intoclient.get_collectionandclient.create_collection. - Add a regression test ensuring
embedding_function=is passed on bothcreate=Trueand cache-miss reopen (create=False) paths. - Add a changelog entry documenting the crash class and the fix.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
mempalace/mcp_server.py |
Resolves EF and threads it through collection reopen/create to avoid ChromaDB default EF binding. |
tests/test_mcp_server.py |
Adds regression coverage asserting EF is passed to ChromaDB client calls in both reopen branches. |
CHANGELOG.md |
Documents the MCP diary-write crash class and the EF-passing fix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| ef = get_embedding_function() | ||
| except Exception: | ||
| logger.exception("Failed to build embedding function; using chromadb default") | ||
| ef = None | ||
| ef_kwargs = {"embedding_function": ef} if ef is not None else {} |
There was a problem hiding this comment.
Good catch — addressed in cd98d66. EF resolution is now scoped to the two branches that actually call client.get_collection / client.create_collection, so the warm-cache path (where the function returns _collection_cache immediately) skips it entirely. Did not plumb device=_config.embedding_device through this PR — MempalaceConfig is instantiated lazily inside get_embedding_function and _EF_CACHE keys on the resolved providers tuple, so the cold-path cost is one MempalaceConfig() + _resolve_providers per cache key. Worth a follow-up if profiling shows it, but felt out of scope for the SIGSEGV fix.
| # ChromaDB 1.x does not persist the embedding function with the | ||
| # collection, so a reader/writer that omits ``embedding_function=`` | ||
| # silently gets the chromadb-built-in default. On bleeding-edge | ||
| # interpreters (#1299: python 3.14 + chromadb 1.5.x on Apple Silicon) | ||
| # the default's lazy ONNX provider selection can SIGSEGV the host | ||
| # process on first ``col.add()``. The miner / Stop hook ingest path | ||
| # avoids this because it routes through ``ChromaBackend.get_collection`` | ||
| # which resolves the EF via ``mempalace.embedding.get_embedding_function``. | ||
| # The MCP server bypassed that abstraction; mirror its behaviour so |
There was a problem hiding this comment.
Fair point — reworded in cd98d66. The new comment now says ChromaDB 1.x persists the EF identity (its name()) but not the instance/configuration, and explicitly walks through why the identity check still passes (both mempalace.embedding._MempalaceONNX and chromadb's built-in DefaultEmbeddingFunction report name()=="default") while the provider list silently differs — which is the actual SIGSEGV trigger. That should line up with the embedding.py docstring instead of contradicting it.
|
|
||
| ### Bug Fixes | ||
|
|
||
| - **MCP server `tool_diary_write` SIGSEGV when EF default differs.** `mcp_server._get_collection` bypassed `ChromaBackend.get_collection` and called `client.get_collection` / `client.create_collection` without `embedding_function=`. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in `DefaultEmbeddingFunction` to the collection while the miner / Stop hook ingest path bound `mempalace.embedding.get_embedding_function()`. On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon) the default's lazy ONNX provider selection could SIGSEGV the host process on first `col.add()`, killing the MCP stdio server and leaving every subsequent tool call returning `Connection closed` until Claude Code was relaunched. `_get_collection` now resolves the EF via `mempalace.embedding.get_embedding_function` and passes it into both reopen branches, matching the miner/backend path. (#1299, follow-up to #1262 / #1289) |
There was a problem hiding this comment.
Reworded in cd98d66 to match the codebase explanation: "ChromaDB 1.x persists the EF identity (its name()) with the collection but not the EF instance/configuration" — and the entry now spells out that the identity check passes (both EFs report name()=="default") while the provider list differs, which is what embedding.py is documenting and what triggered the SIGSEGV.
| # which resolves the EF via ``mempalace.embedding.get_embedding_function``. | ||
| # The MCP server bypassed that abstraction; mirror its behaviour so | ||
| # ``tool_diary_write`` / ``tool_add_drawer`` get the same EF as mining. | ||
| try: | ||
| ef = get_embedding_function() | ||
| except Exception: | ||
| logger.exception("Failed to build embedding function; using chromadb default") | ||
| ef = None |
There was a problem hiding this comment.
Done in cd98d66 — both branches now call ChromaBackend._resolve_embedding_function() directly. Dropped the from .embedding import get_embedding_function import since the helper handles it. Calling a leading-underscore static method from outside the class is mildly smelly, but felt better than promoting it to public surface in this PR; happy to do that as a follow-up if you'd prefer.
- Resolve the EF inside the two reopen branches that actually call `client.get_collection` / `client.create_collection`, so warm-cache reads stay zero-cost (no `MempalaceConfig()` / `_resolve_providers` on every tool call). - Reuse `ChromaBackend._resolve_embedding_function()` instead of duplicating its try/except + log message + None-fallback. - Reword the inline + CHANGELOG explanation to clarify that ChromaDB 1.x persists the EF *identity* (its `name()`) but not the *instance/ configuration* — `mempalace.embedding` documents this and spoofs `name()` to `"default"` precisely so the identity check passes; the bug was the *provider list* (lazy ONNX selection) silently differing.
…Palace#1289/MemPalace#1303 After the 2026-05-03 develop sync to commit 1888b67: - Row 15 (_get_client get-then-create guard) clears: MemPalace#1262 fixed ChromaBackend; MemPalace#1289 fixed mcp_server's parallel call site; MemPalace#1303 plumbed embedding_function= through the reopen path. - Trailing tracking paragraph updated (sync target, count of merged/ open/closed PRs, list of features brought in). - Upstream-PRs table: MemPalace#1262 → merged, MemPalace#1286 flagged CONFLICTING with rebase queued, MemPalace#1289 + MemPalace#1303 added as merged. Note: fork-only _get_session_recovery_collection still uses the older get_or_create_collection pattern; theoretical SIGSEGV exposure on legacy recovery collections only. Tracked as follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…attern to recovery collection `_get_session_recovery_collection` (fork-only, introduced for the checkpoint collection split in row 23) used the older `get_or_create_collection` pattern that MemPalace#1262/MemPalace#1289/MemPalace#1303 just fixed in `_get_collection`. Same theoretical SIGSEGV class on chromadb 1.5.x when stored metadata diverges from the call site, plus the `embedding_function=` omission MemPalace#1303 caught. Both branches now resolve EF via `ChromaBackend._resolve_embedding_function()` and the create-path uses try `get_collection` / except `NotFoundError` → `create_collection`. Recovery collection's metadata is intentionally lighter than the main collection's (no `_HNSW_BLOAT_GUARD`); preserved that asymmetry — recovery is small enough that segment bloat isn't a concern. Existing `test_session_recovery.py` covers happy paths; 83/83 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A small improvement to `_get_collection`: log the exception on failure (instead of returning `None` silently) and retry once after clearing the cached client and collection. Behaviour on the happy path is unchanged. ## Why The current shape (`try: ... except Exception: return None`) has two costs in practice on a long-running deployment: 1. Invisible incidents. When a transient ChromaDB error poisons the cached `_client_cache` / `_collection_cache`, every subsequent call returns `None` but the operator never sees the underlying exception in the log. Debugging has to start from the symptoms — 404s, "no palace" responses — without the actual stack. 2. No self-healing. Once the cache is stale, the only way back to a working state is a process restart. A single retry that clears the caches first resolves the most common shape (a transient error left bad state behind) without operator intervention. ## What changes - One `logger.exception(...)` per failed attempt, including the palace path so multi-palace deployments can disambiguate. - One retry after the first failure, with `_client_cache`, `_collection_cache`, `_metadata_cache`, `_metadata_cache_time` reset before the second attempt so the retry rebuilds from scratch rather than reusing a poisoned handle. - Happy path (first attempt succeeds) is unchanged: same return, same caching, no extra log line. ## Why this is still useful after MemPalace#1289 / MemPalace#1303 MemPalace#1289 split `get_or_create_collection` into `get_collection` + `create_collection` (the metadata-mismatch SIGSEGV path); MemPalace#1303 plumbed `embedding_function=` through both reopen branches. Those fix specific *crash* shapes. This PR addresses the *aftermath* — when an error of any class poisons the client/collection cache and every subsequent call returns `None` invisibly. Retry-once self-heals; logging makes the underlying exception visible. The retry loop sits *above* the get-then-create logic, so retries benefit from MemPalace#1289's correct call shape and MemPalace#1303's EF binding. ## Rebased on top of MemPalace#1289 + MemPalace#1303 This PR was filed before MemPalace#1289/MemPalace#1303 merged. The original two commits (2038269 + 757eb7a) wrapped the older `get_or_create` body. Rebased onto the new `try get_collection / except NotFoundError → create_collection` body — the retry/log envelope is unchanged, only the inner body picks up the merged fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Catches up on a month of upstream work. Highlights pulled in: - MemPalace#1306 searcher: hybrid candidate union (vector ∪ BM25 reranking pool) - MemPalace#1325 mcp security: omit absolute paths from tool_get_drawer / tool_status - MemPalace#1322 chroma: wire quarantine_stale_hnsw to prevent SIGSEGV on stale HNSW - MemPalace#1320 mcp: forward valid_to / source params in kg_add / kg_invalidate - MemPalace#1321 cli: honor --palace flag in cmd_init - MemPalace#1314 kg temporal params fix - MemPalace#1244 cli: cmd_compress writes to mempalace_closets so palace can read - MemPalace#1243 mcp: case-insensitive agent name in diary_write/diary_read - MemPalace#1303 mcp_server: pass embedding_function= on collection reopen - MemPalace#1076/MemPalace#1077 hooks: quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json - Various ruff format passes on touched files Conflict resolution (CHANGELOG.md only — code files all auto-merged): - 3.3.5 unreleased section header from upstream kept above 3.3.4 - 3.3.4 section: kept our 2026-04-30 release date; merged upstream's new MemPalace#1299 SIGSEGV-on-default-EF entry in alongside our existing topic-tunnels (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and auto-ingest (MemPalace#1230/MemPalace#1231) entries. Kept our richer topic-tunnels detail (upstream's version was a strict subset). xdev patches preserved (still on this branch, untouched by merge): - 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings - 3fad61d fix(config): allow leading dash in wing names Not pushed to origin — run tests locally and decide when to push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Resolve the EF inside the two reopen branches that actually call `client.get_collection` / `client.create_collection`, so warm-cache reads stay zero-cost (no `MempalaceConfig()` / `_resolve_providers` on every tool call). - Reuse `ChromaBackend._resolve_embedding_function()` instead of duplicating its try/except + log message + None-fallback. - Reword the inline + CHANGELOG explanation to clarify that ChromaDB 1.x persists the EF *identity* (its `name()`) but not the *instance/ configuration* — `mempalace.embedding` documents this and spoofs `name()` to `"default"` precisely so the identity check passes; the bug was the *provider list* (lazy ONNX selection) silently differing.
* fix(mcp_server): pass embedding_function= on collection reopen (MemPalace#1299) `mcp_server._get_collection` bypassed `ChromaBackend.get_collection` and called `client.get_collection` / `client.create_collection` without `embedding_function=`. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop hook ingest path bound `mempalace.embedding.get_embedding_function()`. On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon, per MemPalace#1299) the default EF's lazy ONNX provider selection could SIGSEGV the host process on first `col.add()`, killing the MCP stdio server and leaving every subsequent tool call returning `Connection closed` until Claude Code was relaunched. Reads worked because `col.get(ids=...)` and metadata fetches don't invoke the EF; the auto-ingest path worked because mining routes through the backend abstraction. Diary writes were the consistent failure surface. Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`) and pass it into both reopen branches. Falls back to the chromadb default only if `mempalace.embedding.get_embedding_function` itself raises. Regression test patches the chromadb client class to capture `embedding_function=` on every `get_collection` / `create_collection` call from `_get_collection(create=True)` and `_get_collection()`, and fails if any call omits it. Follow-up to MemPalace#1262 / MemPalace#1289 (which fixed the metadata-mismatch SIGSEGV path); this addresses the EF-mismatch SIGSEGV path on the same surface. * fix(mcp_server): address copilot review on MemPalace#1303 - Resolve the EF inside the two reopen branches that actually call `client.get_collection` / `client.create_collection`, so warm-cache reads stay zero-cost (no `MempalaceConfig()` / `_resolve_providers` on every tool call). - Reuse `ChromaBackend._resolve_embedding_function()` instead of duplicating its try/except + log message + None-fallback. - Reword the inline + CHANGELOG explanation to clarify that ChromaDB 1.x persists the EF *identity* (its `name()`) but not the *instance/ configuration* — `mempalace.embedding` documents this and spoofs `name()` to `"default"` precisely so the identity check passes; the bug was the *provider list* (lazy ONNX selection) silently differing. * fix(hooks): quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json (MemPalace#1076) (MemPalace#1077) Shell splits hook command on whitespace after variable expansion, breaking paths with spaces (e.g. C:\Users\Richard M on Windows). Wrapping the path in double quotes preserves the token boundary. Fixes the reported Stop/PreCompact pair in .claude-plugin/hooks/hooks.json and applies the same fix to .codex-plugin/hooks.json (SessionStart/Stop/ PreCompact), which carries the identical bug. * fix(cli): write compress output to mempalace_closets so palace can read them (MemPalace#1244) `cmd_compress` was writing AAAK-compressed drawers to a `mempalace_compressed` collection, but every read path (`palace.get_closets_collection`, `searcher.py`, `repair.py`) reads from `mempalace_closets`. Result: for non-mined palaces (or any palace where the user ran `mempalace compress` expecting to backfill the closet/index layer), the compressed output was silently invisible — written to a collection nothing else opens. Fix the writer rather than renaming the readers: "closets" is the user-visible feature name baked into the public API (`get_closets_collection`), the searcher hybrid path, repair/HNSW diagnostics, and docs. Renaming the readers would churn 15+ call sites and the README for no benefit. The compressed AAAK strings are exactly what closets are conceptually — compact pointers scanned by an LLM to locate the right drawer — so they belong in `mempalace_closets`. Tests: - Update `test_cmd_compress_stores_results` to assert the collection name passed to `get_or_create_collection` is `mempalace_closets`. - Add `test_cmd_compress_output_readable_via_get_closets_collection`: end-to-end with a real ChromaBackend, seed a drawer, run cmd_compress, then read back via the same `get_closets_collection` helper that palace.py / searcher use. Regression test for the wrong-collection bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: ruff format cli.py (MemPalace#1244) CI requires ruff format --check on the whole touched file. Pre-existing drift, no logic change. * style: ruff format tests/test_cli.py (PR MemPalace#1319) * fix(mcp): case-insensitive agent name in diary_write/diary_read (MemPalace#1243) `tool_diary_write` stored the `agent` metadata verbatim after `sanitize_name` (which preserves case), while `tool_diary_read` filtered by exact match — so writing as "Claude" and reading as "claude" silently returned zero rows. Both endpoints now lowercase `agent_name` immediately after sanitization. The default per-agent wing slug is also stable across casings since it's derived from the same normalized form. Behavior change: entries written prior to this fix under mixed-case agent names will not match the new lowercase filter; documented under v3.3.5 in CHANGELOG with a `mempalace repair` pointer. Adds a regression test (`test_diary_read_case_insensitive_agent`) and updates the existing `test_diary_write_and_read` to assert the new lowercase agent identity. Closes MemPalace#1243 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: ruff format tests/test_mcp_server.py (PR MemPalace#1323) * fix(backends/chroma): wire quarantine_stale_hnsw into _client() to prevent SIGSEGV on stale HNSW (MemPalace#1121, MemPalace#1132, MemPalace#1263) PR MemPalace#1173 wired quarantine_stale_hnsw into the static make_client() helper but not into the instance _client() method. As a result every non-MCP entry point (CLI mining, search, repair, status) — which all use get_collection / _get_or_create_collection / _client() — skipped the cold-start quarantine pass and could SIGSEGV on a stale HNSW segment left over from a partial flush, replicated palace, or crashed-mid-write. Refactor: extract the (_fix_blob_seq_ids + gated quarantine_stale_hnsw) pre-open pass into a single private static helper ChromaBackend._prepare_palace_for_open(). Both make_client() and _client() now route through it, so the _quarantined_paths once-per- palace-per-process gate is preserved (no runtime thrash on hot paths) and behaviour stays identical — the fix is purely about extending the existing protection to the path that was missing it. Tests: - test_client_quarantines_corrupt_segment_on_first_open mirrors the existing make_client test and verifies _client() actually renames a corrupt segment on first open. - test_client_quarantines_only_on_first_call_per_palace verifies the cache gate prevents re-running quarantine across repeated _client() calls — important because _client() is hit on every backend op. Closes MemPalace#1121. Closes MemPalace#1132. Closes MemPalace#1263. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: ruff format touched files (PR MemPalace#1322) CI requires whole-file format on touched files; pre-existing drift only. * style: ruff format tests/test_backends.py (PR MemPalace#1322) * fix(mcp): omit palace_path from tool_status responses (+ docs) The MCP `mempalace_status` tool was returning the server's absolute `_config.palace_path` to any connected client on both the main (ChromaDB-backed) path and the sqlite fallback path that runs when HNSW divergence is detected (MemPalace#1222). On a single-user local deployment this is self-disclosure, but in nested-agent or multi-server MCP topologies the client is a separate trust domain and the absolute path has no documented client-side use. Clients that legitimately need the palace path continue to have three documented channels: the `MEMPALACE_PALACE_PATH` env var (primary) or its legacy `MEMPAL_PALACE_PATH` alias, the `~/.mempalace/config.json` file, and the `--palace` CLI flag on most subcommands. Also corrects stale docs that claimed `mempalace_reconnect` returned a `palace_path` field; the code returns `{success, message, drawers, vector_disabled[, vector_disabled_reason]}` on success, plus a no-palace shape and an exception shape. - mempalace/mcp_server.py: drop palace_path from tool_status() and _tool_status_via_sqlite() result dicts - website/reference/mcp-tools.md: update documented return shapes for mempalace_status (fix) and mempalace_reconnect (stale-docs correction) Authored-by: Aaron Salsitz (ICCI LLC, @icciaaron). Claude Code was used as an authoring and review-orchestration tool, with human-in-the-loop oversight at every step: Aaron wrote the prompts, reviewed each draft, called for three independent review passes (drafting / post-rebase technical / CISA-aligned disclosure-leak), and verified the final patch behavior before commit. * fix(mcp): basename source_file in tool_get_drawer responses The MCP `mempalace_get_drawer` tool returned the entire raw drawer metadata blob to any connected client, and the `source_file` field in that blob is the absolute filesystem path written by the miners (`miner.py`, `convo_miner.py` — `source_file = str(filepath)`). On a single-user local deployment this is self-disclosure, but in nested-agent or multi-server MCP topologies the client is a separate trust domain and the host's directory layout has no documented client-side use. Mirror the mitigation that `searcher.search_memories()` already applies on its own return path: reduce `source_file` to its basename via `Path(source_file).name` before handing the metadata to the client. Citations still work — the directory layout does not leak. Companion to #1 (omit palace_path from tool_status). Same threat class, different surface: - mempalace_status — palace dir path → fixed in #1 - mempalace_get_drawer — per-drawer source_file path → this PR Other read tools were audited and do not leak host paths: - mempalace_search — already basenames source_file - mempalace_list_drawers — returns wing/room/preview only - mempalace_diary_read — date/timestamp/topic/content only - mempalace_reconnect — success/message/drawers only - mempalace_kg_* — entity/predicate strings, counts - mempalace_check_duplicate — wing/room/preview only Changes: - mempalace/mcp_server.py: tool_get_drawer() now basenames metadata.source_file - tests/test_mcp_server.py: regression test asserting the absolute path and its parent directory do not appear anywhere in the response - website/reference/mcp-tools.md: clarify the documented return shape * feat(searcher): add candidate_strategy="union" for vector∪BM25 reranking pool Default search behavior is unchanged. Opt-in candidate_strategy="union" also pulls top-K BM25-only candidates from sqlite FTS5 and merges them into the rerank pool, catching docs with strong BM25 signal that the vector index didn't surface in the over-fetch window. Motivation ---------- The current hybrid path gathers candidates from the vector index only (n_results * 3 over-fetch), then BM25-reranks within them. When the query embeds close to the wrong content semantically, the right doc never enters the rerank pool — *no matter how wide the over-fetch*. Tested on a ~6K-document mixed corpus (knowledge prose + short structured records): at *30x* over-fetch (~5% of the corpus) the target doc still didn't surface for narrative-shaped queries targeting terminology guides. Wider over-fetch isn't the answer; widening the pool's *source* is. Concrete failure mode: a narrative-shaped query embeds close to records sharing the same operational vocabulary (other narrative entries in the corpus). A terminology / style guide is BM25-strong for the query (rare keywords the guide repeats) but vector-distant. Vector-only candidates don't include it; BM25 never gets to rerank it. The hybrid path produces 0.00 recall on a probe that pure BM25 alone scores 1.00 — the hybrid is worse than its component on the same input. Behavior change --------------- * New parameter ``candidate_strategy: str = "vector"`` on ``search_memories``. - ``"vector"`` (default): historical behavior, no change. - ``"union"``: also fetch top ``n_results * 3`` candidates via the existing ``_bm25_only_via_sqlite`` helper, dedupe by source_file, merge into the rerank pool. BM25-only candidates carry ``distance=None`` so they're scored on BM25 contribution alone (vec_sim coerces to 0). * ``_hybrid_rank`` now handles ``distance=None`` explicitly, scoring such candidates as vector-unknown (vec_sim=0) rather than treating it as max-distance via shim. * New strategies register via ``_CANDIDATE_MERGERS``; dispatch is in ``_apply_candidate_strategy`` so ``search_memories`` stays under the C901 complexity ceiling. Bench numbers (~6K-doc internal mixed corpus, recall@10, 5 probes spanning policy-exception lookup, temporal-decay, style retrieval, set-difference, and pattern-recognition): baseline ("vector") "union" policy-exception probe 0.00 0.50 +0.50 temporal-decay probe 0.17 0.50 +0.33 style-retrieval probe 0.00 1.00 +1.00 (PASSES) set-difference probe 0.00–0.06 0.06–0.09 ~ pattern-recog probe 0.64 (stable) 0.50–0.71 variance, typ. +0.07 macro recall 0.16–0.17 0.51–0.56 +0.34 to +0.40 The pattern-recog variance points at a related issue worth a separate PR: ``_hybrid_rank`` computes BM25 IDF over the candidate set. Adding new candidates re-normalizes BM25 for *existing* candidates non-monotonically. Stable corpus-wide BM25 would remove this. Out of scope here. Tests ----- ``tests/test_hybrid_candidate_union.py`` (6 tests, all pass): - default behavior unchanged (explicit ``"vector"`` matches default) - ``"union"`` surfaces a BM25-strong vector-distant doc - ``"union"`` doesn't drop docs ``"vector"`` would have found - empty-palace handling - invalid ``candidate_strategy`` raises - ``_hybrid_rank`` tolerates ``distance=None`` Existing ``test_hybrid_search.py`` (5) and ``test_searcher.py`` (27) pass. Performance note ---------------- Each ``"union"`` query adds one sqlite open + FTS5 MATCH + metadata fetch (via the existing ``_bm25_only_via_sqlite`` helper, which already runs as the ``vector_disabled`` fallback path so the code is well-trodden). Per-query overhead is small but unmeasured at corpus scale. Default stays ``"vector"`` until a maintainer characterizes the cost. * fix(searcher): address Copilot review on MemPalace#1306 - Dedup union candidates by (full_path, chunk_index), not basename — two files sharing a basename in different dirs no longer collide, and a vector hit on chunk N of a file no longer blocks BM25 from contributing chunk M of the same file. - Validate candidate_strategy at the top of search_memories so invalid values fail consistently, not only when the call routes through the vector path. - Trim hits back to n_results after the union+rerank pool grows; preserves the existing search_memories size contract that the MCP limit parameter is built on. - Skip BM25-only injection when max_distance > 0.0; BM25-only candidates carry distance=None and would silently bypass the caller's strict vector-distance threshold. Adds 4 tests covering: validation under vector_disabled, n_results trim, max_distance honoring, and basename-collision dedup. * fix(mcp): forward valid_to and source params in kg_add/kg_invalidate (MemPalace#1314) `tool_kg_add` previously accepted only `valid_from` and `source_closet`, silently dropping `valid_to`, `source_file`, and `source_drawer_id` at the MCP boundary. Backfilling already-ended historical facts therefore collapsed to "still current," and adapter provenance never reached the SQLite layer even though `KnowledgeGraph.add_triple` already supported every column. `tool_kg_invalidate` returned the literal string `"today"` whenever the caller omitted `ended`, hiding the actual stamped date from anyone trying to verify what got persisted. Changes: - Extend `tool_kg_add` signature + MCP input_schema with `valid_to`, `source_file`, `source_drawer_id`; forward all of them to `_kg.add_triple` and to the WAL log. - Resolve `ended` to `date.today().isoformat()` in `tool_kg_invalidate` before logging / returning, so the response always reports the actual date stored in `valid_to`. - Add regression tests for valid_to round-trip, source_file / source_drawer_id provenance, and the resolved-ended-date contract. - Leave TODO(MemPalace#1283) markers so the open ISO-8601 validation PR can drop `validate_iso_date` over `valid_from` / `valid_to` / `ended` cleanly. The underlying `KnowledgeGraph.add_triple` already accepted these kwargs (RFC 002 §5.5) — only the MCP edge needed wiring up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: ruff format test_mcp_server.py (PR MemPalace#1320) * docs(mcp): drop §5.5 from kg_add docstring/schema The repo's anti-jargon meta-test bans §N markers outside the sources/backends allowlist. mcp_server.py isn't allowlisted, so the "RFC 002 §5.5" references added in this PR turned the test red. Trim to "RFC 002" — section number isn't load-bearing for the description. * fix(cli): honor --palace flag in cmd_init (MemPalace#1313) cmd_init was instantiating MempalaceConfig() unconditionally, ignoring args.palace and always writing the palace under ~/.mempalace. Mirror the env-var pattern used by mcp_server.py (and consistent with how cmd_mine / cmd_status / cmd_search resolve --palace) so every downstream read of cfg.palace_path inside cmd_init — Pass 0, cfg.init(), and the post-init mine — routes to the user-specified location. Adds tests/test_cli.py::test_cmd_init_honors_palace_flag covering the regression: asserts Pass 0 receives the --palace value (not ~/.mempalace) and that MEMPALACE_PALACE_PATH is set in os.environ. Closes MemPalace#1313. * test(cli): prime monkeypatch undo so palace env doesn't leak monkeypatch.delenv(name, raising=False) on a missing key registers no undo entry, so the env var cmd_init writes leaked into test_config_from_file on Python 3.13 / Windows / macOS. Prime the slot with setenv before delenv so teardown rolls back the write. * feat(mcp): persist drawer metadata * fix: address metadata passthrough review blockers * Close metadata passthrough review findings F03 now logs corrupted drawer metadata decode failures through the mempalace_mcp logger, truncating the stored value to value_head so operators get an identifier without leaking the full payload. F15 is Case B: ChromaBackend._resolve_embedding_function() can return None because it catches any exception from get_embedding_function() and returns None after logging; ChromaBackend.get_collection() mirrors the same conditional kwargs behavior and does not repair that case. For the MCP _get_collection call site, treat a None EF as a fail-closed condition and never reopen a collection without embedding_function=, so the MemPalace#1299 default-EF regression path cannot silently reappear there. Validation: uv run pytest -q; uv run ruff check . --------- Co-authored-by: Igor Lins e Silva <4753812+igorls@users.noreply.github.com> Co-authored-by: Mikhail Valentsev <michael@valentsev.ru> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: icciAaron <asalsitz@icci.com>
Summary
mcp_server._get_collectionbypassedChromaBackend.get_collectionand calledclient.get_collection/client.create_collectionwithoutembedding_function=. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-inDefaultEmbeddingFunctionwhile the miner / Stop hook ingest path boundmempalace.embedding.get_embedding_function().col.add(), killing the MCP stdio server and leaving every subsequent tool call returningConnection closeduntil Claude Code was relaunched.ChromaBackend._resolve_embedding_function) and pass it into both reopen branches.Why diary writes specifically
tool_status,tool_diary_read,tool_list_wings, etc.) succeed becausecol.get(ids=...)/ metadata fetches /col.count()never invoke the EF.ChromaBackend.get_collection(with proper EF) — only the in-process MCP write path hit the EF-less collection handle.tool_diary_write(andtool_add_drawer) triggercol.add(documents=[...]), which is when ChromaDB lazy-loads the bound EF for the first time. On the reporter's environment that load segfaults the host process.Follow-up to
_get_collection(create=True)(splitget_or_create_collectioninto try-get_collection/ except-create_collection).Closes #1299.
Test plan
TestCacheInvalidation::test_get_collection_passes_embedding_functionpatches the chromadb client class to captureembedding_function=on everyget_collection/create_collectioncall from both reopen branches; fails if any call omits it. Verified to fail ondevelopHEAD without this PR's mcp_server.py change.tests/test_mcp_server.pysuite passes (66/66).pytest tests/ --ignore=tests/benchmarkspasses (1470 passed, 1 skipped).ruff check+ruff format --checkclean.