Skip to content

fix(mcp_server): pass embedding_function= on collection reopen (#1299)#1303

Merged
igorls merged 2 commits into
developfrom
fix/mcp-server-missing-embedding-function
May 1, 2026
Merged

fix(mcp_server): pass embedding_function= on collection reopen (#1299)#1303
igorls merged 2 commits into
developfrom
fix/mcp-server-missing-embedding-function

Conversation

@igorls

@igorls igorls commented May 1, 2026

Copy link
Copy Markdown
Member

Summary

  • mcp_server._get_collection bypassed ChromaBackend.get_collection and called client.get_collection / client.create_collection without embedding_function=. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in DefaultEmbeddingFunction while the miner / Stop hook ingest path bound mempalace.embedding.get_embedding_function().
  • On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon, per MCP server SIGSEGV on tool_diary_write in 3.3.4 — Connection closed; server unrecoverable in-session #1299) the default EF's lazy ONNX provider selection could SIGSEGV the host process on first col.add(), killing the MCP stdio server and leaving every subsequent tool call returning Connection closed until Claude Code was relaunched.
  • Resolve the EF up front (matching ChromaBackend._resolve_embedding_function) and pass it into both reopen branches.

Why diary writes specifically

  • Reads (tool_status, tool_diary_read, tool_list_wings, etc.) succeed because col.get(ids=...) / metadata fetches / col.count() never invoke the EF.
  • The Stop hook auto-mining keeps working because mining routes through ChromaBackend.get_collection (with proper EF) — only the in-process MCP write path hit the EF-less collection handle.
  • tool_diary_write (and tool_add_drawer) trigger col.add(documents=[...]), which is when ChromaDB lazy-loads the bound EF for the first time. On the reporter's environment that load segfaults the host process.

Follow-up to

Closes #1299.

Test plan

  • New regression test TestCacheInvalidation::test_get_collection_passes_embedding_function patches the chromadb client class to capture embedding_function= on every get_collection / create_collection call from both reopen branches; fails if any call omits it. Verified to fail on develop HEAD without this PR's mcp_server.py change.
  • Full tests/test_mcp_server.py suite passes (66/66).
  • Full pytest tests/ --ignore=tests/benchmarks passes (1470 passed, 1 skipped).
  • ruff check + ruff format --check clean.
  • Reporter (@bespined) to confirm the fix on python 3.14 + chromadb 1.5.x + Apple Silicon once a 3.3.5 build is available.

`mcp_server._get_collection` bypassed `ChromaBackend.get_collection`
and called `client.get_collection` / `client.create_collection` without
`embedding_function=`. ChromaDB 1.x does not persist the EF identity
with the collection, so the MCP server's reopen silently bound
chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop
hook ingest path bound `mempalace.embedding.get_embedding_function()`.

On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple
Silicon, per #1299) the default EF's lazy ONNX provider selection could
SIGSEGV the host process on first `col.add()`, killing the MCP stdio
server and leaving every subsequent tool call returning
`Connection closed` until Claude Code was relaunched. Reads worked
because `col.get(ids=...)` and metadata fetches don't invoke the EF;
the auto-ingest path worked because mining routes through the backend
abstraction. Diary writes were the consistent failure surface.

Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`)
and pass it into both reopen branches. Falls back to the chromadb default
only if `mempalace.embedding.get_embedding_function` itself raises.

Regression test patches the chromadb client class to capture
`embedding_function=` on every `get_collection` / `create_collection`
call from `_get_collection(create=True)` and `_get_collection()`, and
fails if any call omits it.

Follow-up to #1262 / #1289 (which fixed the metadata-mismatch SIGSEGV
path); this addresses the EF-mismatch SIGSEGV path on the same surface.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an MCP-server-specific ChromaDB collection reopen path to consistently bind MemPalace’s embedding function (instead of falling back to ChromaDB’s default EF), preventing EF-mismatch-induced crashes and aligning MCP write behavior with the miner/backend path.

Changes:

  • Update mcp_server._get_collection to resolve an embedding function and pass embedding_function= into client.get_collection and client.create_collection.
  • Add a regression test ensuring embedding_function= is passed on both create=True and cache-miss reopen (create=False) paths.
  • Add a changelog entry documenting the crash class and the fix.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
mempalace/mcp_server.py Resolves EF and threads it through collection reopen/create to avoid ChromaDB default EF binding.
tests/test_mcp_server.py Adds regression coverage asserting EF is passed to ChromaDB client calls in both reopen branches.
CHANGELOG.md Documents the MCP diary-write crash class and the EF-passing fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/mcp_server.py Outdated
Comment on lines +292 to +297
try:
ef = get_embedding_function()
except Exception:
logger.exception("Failed to build embedding function; using chromadb default")
ef = None
ef_kwargs = {"embedding_function": ef} if ef is not None else {}

@igorls igorls May 1, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — addressed in cd98d66. EF resolution is now scoped to the two branches that actually call client.get_collection / client.create_collection, so the warm-cache path (where the function returns _collection_cache immediately) skips it entirely. Did not plumb device=_config.embedding_device through this PR — MempalaceConfig is instantiated lazily inside get_embedding_function and _EF_CACHE keys on the resolved providers tuple, so the cold-path cost is one MempalaceConfig() + _resolve_providers per cache key. Worth a follow-up if profiling shows it, but felt out of scope for the SIGSEGV fix.

Comment thread mempalace/mcp_server.py Outdated
Comment on lines +282 to +290
# ChromaDB 1.x does not persist the embedding function with the
# collection, so a reader/writer that omits ``embedding_function=``
# silently gets the chromadb-built-in default. On bleeding-edge
# interpreters (#1299: python 3.14 + chromadb 1.5.x on Apple Silicon)
# the default's lazy ONNX provider selection can SIGSEGV the host
# process on first ``col.add()``. The miner / Stop hook ingest path
# avoids this because it routes through ``ChromaBackend.get_collection``
# which resolves the EF via ``mempalace.embedding.get_embedding_function``.
# The MCP server bypassed that abstraction; mirror its behaviour so

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point — reworded in cd98d66. The new comment now says ChromaDB 1.x persists the EF identity (its name()) but not the instance/configuration, and explicitly walks through why the identity check still passes (both mempalace.embedding._MempalaceONNX and chromadb's built-in DefaultEmbeddingFunction report name()=="default") while the provider list silently differs — which is the actual SIGSEGV trigger. That should line up with the embedding.py docstring instead of contradicting it.

Comment thread CHANGELOG.md Outdated

### Bug Fixes

- **MCP server `tool_diary_write` SIGSEGV when EF default differs.** `mcp_server._get_collection` bypassed `ChromaBackend.get_collection` and called `client.get_collection` / `client.create_collection` without `embedding_function=`. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in `DefaultEmbeddingFunction` to the collection while the miner / Stop hook ingest path bound `mempalace.embedding.get_embedding_function()`. On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon) the default's lazy ONNX provider selection could SIGSEGV the host process on first `col.add()`, killing the MCP stdio server and leaving every subsequent tool call returning `Connection closed` until Claude Code was relaunched. `_get_collection` now resolves the EF via `mempalace.embedding.get_embedding_function` and passes it into both reopen branches, matching the miner/backend path. (#1299, follow-up to #1262 / #1289)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded in cd98d66 to match the codebase explanation: "ChromaDB 1.x persists the EF identity (its name()) with the collection but not the EF instance/configuration" — and the entry now spells out that the identity check passes (both EFs report name()=="default") while the provider list differs, which is what embedding.py is documenting and what triggered the SIGSEGV.

Comment thread mempalace/mcp_server.py Outdated
Comment on lines +289 to +296
# which resolves the EF via ``mempalace.embedding.get_embedding_function``.
# The MCP server bypassed that abstraction; mirror its behaviour so
# ``tool_diary_write`` / ``tool_add_drawer`` get the same EF as mining.
try:
ef = get_embedding_function()
except Exception:
logger.exception("Failed to build embedding function; using chromadb default")
ef = None

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in cd98d66 — both branches now call ChromaBackend._resolve_embedding_function() directly. Dropped the from .embedding import get_embedding_function import since the helper handles it. Calling a leading-underscore static method from outside the class is mildly smelly, but felt better than promoting it to public surface in this PR; happy to do that as a follow-up if you'd prefer.

- Resolve the EF inside the two reopen branches that actually call
  `client.get_collection` / `client.create_collection`, so warm-cache
  reads stay zero-cost (no `MempalaceConfig()` / `_resolve_providers`
  on every tool call).
- Reuse `ChromaBackend._resolve_embedding_function()` instead of
  duplicating its try/except + log message + None-fallback.
- Reword the inline + CHANGELOG explanation to clarify that ChromaDB 1.x
  persists the EF *identity* (its `name()`) but not the *instance/
  configuration* — `mempalace.embedding` documents this and spoofs
  `name()` to `"default"` precisely so the identity check passes; the
  bug was the *provider list* (lazy ONNX selection) silently differing.
@igorls igorls merged commit 5ddaf7a into develop May 1, 2026
6 checks passed
jphein added a commit to techempower-org/mempalace that referenced this pull request May 3, 2026
…Palace#1289/MemPalace#1303

After the 2026-05-03 develop sync to commit 1888b67:
- Row 15 (_get_client get-then-create guard) clears: MemPalace#1262 fixed
  ChromaBackend; MemPalace#1289 fixed mcp_server's parallel call site; MemPalace#1303
  plumbed embedding_function= through the reopen path.
- Trailing tracking paragraph updated (sync target, count of merged/
  open/closed PRs, list of features brought in).
- Upstream-PRs table: MemPalace#1262 → merged, MemPalace#1286 flagged CONFLICTING with
  rebase queued, MemPalace#1289 + MemPalace#1303 added as merged.

Note: fork-only _get_session_recovery_collection still uses the older
get_or_create_collection pattern; theoretical SIGSEGV exposure on
legacy recovery collections only. Tracked as follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jphein added a commit to techempower-org/mempalace that referenced this pull request May 3, 2026
…attern to recovery collection

`_get_session_recovery_collection` (fork-only, introduced for the
checkpoint collection split in row 23) used the older
`get_or_create_collection` pattern that MemPalace#1262/MemPalace#1289/MemPalace#1303 just
fixed in `_get_collection`. Same theoretical SIGSEGV class on
chromadb 1.5.x when stored metadata diverges from the call site,
plus the `embedding_function=` omission MemPalace#1303 caught.

Both branches now resolve EF via `ChromaBackend._resolve_embedding_function()`
and the create-path uses try `get_collection` / except
`NotFoundError` → `create_collection`. Recovery collection's
metadata is intentionally lighter than the main collection's
(no `_HNSW_BLOAT_GUARD`); preserved that asymmetry — recovery is
small enough that segment bloat isn't a concern.

Existing `test_session_recovery.py` covers happy paths; 83/83 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jphein added a commit to techempower-org/mempalace that referenced this pull request May 3, 2026
A small improvement to `_get_collection`: log the exception on failure
(instead of returning `None` silently) and retry once after clearing
the cached client and collection. Behaviour on the happy path is
unchanged.

## Why

The current shape (`try: ... except Exception: return None`) has two
costs in practice on a long-running deployment:

1. Invisible incidents. When a transient ChromaDB error poisons the
   cached `_client_cache` / `_collection_cache`, every subsequent
   call returns `None` but the operator never sees the underlying
   exception in the log. Debugging has to start from the symptoms —
   404s, "no palace" responses — without the actual stack.
2. No self-healing. Once the cache is stale, the only way back to a
   working state is a process restart. A single retry that clears
   the caches first resolves the most common shape (a transient
   error left bad state behind) without operator intervention.

## What changes

- One `logger.exception(...)` per failed attempt, including the
  palace path so multi-palace deployments can disambiguate.
- One retry after the first failure, with `_client_cache`,
  `_collection_cache`, `_metadata_cache`, `_metadata_cache_time`
  reset before the second attempt so the retry rebuilds from
  scratch rather than reusing a poisoned handle.
- Happy path (first attempt succeeds) is unchanged: same return,
  same caching, no extra log line.

## Why this is still useful after MemPalace#1289 / MemPalace#1303

MemPalace#1289 split `get_or_create_collection` into `get_collection` +
`create_collection` (the metadata-mismatch SIGSEGV path); MemPalace#1303
plumbed `embedding_function=` through both reopen branches. Those
fix specific *crash* shapes. This PR addresses the *aftermath* —
when an error of any class poisons the client/collection cache and
every subsequent call returns `None` invisibly. Retry-once
self-heals; logging makes the underlying exception visible. The
retry loop sits *above* the get-then-create logic, so retries
benefit from MemPalace#1289's correct call shape and MemPalace#1303's EF binding.

## Rebased on top of MemPalace#1289 + MemPalace#1303

This PR was filed before MemPalace#1289/MemPalace#1303 merged. The original two
commits (2038269 + 757eb7a) wrapped the older `get_or_create`
body. Rebased onto the new `try get_collection / except
NotFoundError → create_collection` body — the retry/log envelope
is unchanged, only the inner body picks up the merged fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xcarbo added a commit to xcarbo/mempalace that referenced this pull request May 4, 2026
Catches up on a month of upstream work. Highlights pulled in:

- MemPalace#1306 searcher: hybrid candidate union (vector ∪ BM25 reranking pool)
- MemPalace#1325 mcp security: omit absolute paths from tool_get_drawer / tool_status
- MemPalace#1322 chroma: wire quarantine_stale_hnsw to prevent SIGSEGV on stale HNSW
- MemPalace#1320 mcp: forward valid_to / source params in kg_add / kg_invalidate
- MemPalace#1321 cli: honor --palace flag in cmd_init
- MemPalace#1314 kg temporal params fix
- MemPalace#1244 cli: cmd_compress writes to mempalace_closets so palace can read
- MemPalace#1243 mcp: case-insensitive agent name in diary_write/diary_read
- MemPalace#1303 mcp_server: pass embedding_function= on collection reopen
- MemPalace#1076/MemPalace#1077 hooks: quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json
- Various ruff format passes on touched files

Conflict resolution (CHANGELOG.md only — code files all auto-merged):

- 3.3.5 unreleased section header from upstream kept above 3.3.4
- 3.3.4 section: kept our 2026-04-30 release date; merged upstream's new
  MemPalace#1299 SIGSEGV-on-default-EF entry in alongside our existing topic-tunnels
  (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and
  auto-ingest (MemPalace#1230/MemPalace#1231) entries. Kept our richer topic-tunnels detail
  (upstream's version was a strict subset).

xdev patches preserved (still on this branch, untouched by merge):
- 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings
- 3fad61d fix(config): allow leading dash in wing names

Not pushed to origin — run tests locally and decide when to push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rglaubitz pushed a commit to rglaubitz/mempalace that referenced this pull request May 26, 2026
- Resolve the EF inside the two reopen branches that actually call
  `client.get_collection` / `client.create_collection`, so warm-cache
  reads stay zero-cost (no `MempalaceConfig()` / `_resolve_providers`
  on every tool call).
- Reuse `ChromaBackend._resolve_embedding_function()` instead of
  duplicating its try/except + log message + None-fallback.
- Reword the inline + CHANGELOG explanation to clarify that ChromaDB 1.x
  persists the EF *identity* (its `name()`) but not the *instance/
  configuration* — `mempalace.embedding` documents this and spoofs
  `name()` to `"default"` precisely so the identity check passes; the
  bug was the *provider list* (lazy ONNX selection) silently differing.
rglaubitz added a commit to rglaubitz/mempalace that referenced this pull request May 26, 2026
* fix(mcp_server): pass embedding_function= on collection reopen (MemPalace#1299)

`mcp_server._get_collection` bypassed `ChromaBackend.get_collection`
and called `client.get_collection` / `client.create_collection` without
`embedding_function=`. ChromaDB 1.x does not persist the EF identity
with the collection, so the MCP server's reopen silently bound
chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop
hook ingest path bound `mempalace.embedding.get_embedding_function()`.

On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple
Silicon, per MemPalace#1299) the default EF's lazy ONNX provider selection could
SIGSEGV the host process on first `col.add()`, killing the MCP stdio
server and leaving every subsequent tool call returning
`Connection closed` until Claude Code was relaunched. Reads worked
because `col.get(ids=...)` and metadata fetches don't invoke the EF;
the auto-ingest path worked because mining routes through the backend
abstraction. Diary writes were the consistent failure surface.

Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`)
and pass it into both reopen branches. Falls back to the chromadb default
only if `mempalace.embedding.get_embedding_function` itself raises.

Regression test patches the chromadb client class to capture
`embedding_function=` on every `get_collection` / `create_collection`
call from `_get_collection(create=True)` and `_get_collection()`, and
fails if any call omits it.

Follow-up to MemPalace#1262 / MemPalace#1289 (which fixed the metadata-mismatch SIGSEGV
path); this addresses the EF-mismatch SIGSEGV path on the same surface.

* fix(mcp_server): address copilot review on MemPalace#1303

- Resolve the EF inside the two reopen branches that actually call
  `client.get_collection` / `client.create_collection`, so warm-cache
  reads stay zero-cost (no `MempalaceConfig()` / `_resolve_providers`
  on every tool call).
- Reuse `ChromaBackend._resolve_embedding_function()` instead of
  duplicating its try/except + log message + None-fallback.
- Reword the inline + CHANGELOG explanation to clarify that ChromaDB 1.x
  persists the EF *identity* (its `name()`) but not the *instance/
  configuration* — `mempalace.embedding` documents this and spoofs
  `name()` to `"default"` precisely so the identity check passes; the
  bug was the *provider list* (lazy ONNX selection) silently differing.

* fix(hooks): quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json (MemPalace#1076) (MemPalace#1077)

Shell splits hook command on whitespace after variable expansion, breaking
paths with spaces (e.g. C:\Users\Richard M on Windows). Wrapping the path
in double quotes preserves the token boundary.

Fixes the reported Stop/PreCompact pair in .claude-plugin/hooks/hooks.json
and applies the same fix to .codex-plugin/hooks.json (SessionStart/Stop/
PreCompact), which carries the identical bug.

* fix(cli): write compress output to mempalace_closets so palace can read them (MemPalace#1244)

`cmd_compress` was writing AAAK-compressed drawers to a `mempalace_compressed`
collection, but every read path (`palace.get_closets_collection`,
`searcher.py`, `repair.py`) reads from `mempalace_closets`. Result: for
non-mined palaces (or any palace where the user ran `mempalace compress`
expecting to backfill the closet/index layer), the compressed output was
silently invisible — written to a collection nothing else opens.

Fix the writer rather than renaming the readers: "closets" is the
user-visible feature name baked into the public API
(`get_closets_collection`), the searcher hybrid path, repair/HNSW
diagnostics, and docs. Renaming the readers would churn 15+ call sites
and the README for no benefit. The compressed AAAK strings are exactly
what closets are conceptually — compact pointers scanned by an LLM to
locate the right drawer — so they belong in `mempalace_closets`.

Tests:
- Update `test_cmd_compress_stores_results` to assert the collection
  name passed to `get_or_create_collection` is `mempalace_closets`.
- Add `test_cmd_compress_output_readable_via_get_closets_collection`:
  end-to-end with a real ChromaBackend, seed a drawer, run cmd_compress,
  then read back via the same `get_closets_collection` helper that
  palace.py / searcher use. Regression test for the wrong-collection
  bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: ruff format cli.py (MemPalace#1244)

CI requires ruff format --check on the whole touched file. Pre-existing drift, no logic change.

* style: ruff format tests/test_cli.py (PR MemPalace#1319)

* fix(mcp): case-insensitive agent name in diary_write/diary_read (MemPalace#1243)

`tool_diary_write` stored the `agent` metadata verbatim after `sanitize_name`
(which preserves case), while `tool_diary_read` filtered by exact match —
so writing as "Claude" and reading as "claude" silently returned zero rows.

Both endpoints now lowercase `agent_name` immediately after sanitization.
The default per-agent wing slug is also stable across casings since it's
derived from the same normalized form.

Behavior change: entries written prior to this fix under mixed-case agent
names will not match the new lowercase filter; documented under v3.3.5
in CHANGELOG with a `mempalace repair` pointer.

Adds a regression test (`test_diary_read_case_insensitive_agent`) and
updates the existing `test_diary_write_and_read` to assert the new
lowercase agent identity.

Closes MemPalace#1243

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: ruff format tests/test_mcp_server.py (PR MemPalace#1323)

* fix(backends/chroma): wire quarantine_stale_hnsw into _client() to prevent SIGSEGV on stale HNSW (MemPalace#1121, MemPalace#1132, MemPalace#1263)

PR MemPalace#1173 wired quarantine_stale_hnsw into the static make_client() helper
but not into the instance _client() method. As a result every non-MCP
entry point (CLI mining, search, repair, status) — which all use
get_collection / _get_or_create_collection / _client() — skipped the
cold-start quarantine pass and could SIGSEGV on a stale HNSW segment
left over from a partial flush, replicated palace, or crashed-mid-write.

Refactor: extract the (_fix_blob_seq_ids + gated quarantine_stale_hnsw)
pre-open pass into a single private static helper
ChromaBackend._prepare_palace_for_open(). Both make_client() and
_client() now route through it, so the _quarantined_paths once-per-
palace-per-process gate is preserved (no runtime thrash on hot paths)
and behaviour stays identical — the fix is purely about extending the
existing protection to the path that was missing it.

Tests:

- test_client_quarantines_corrupt_segment_on_first_open mirrors the
  existing make_client test and verifies _client() actually renames a
  corrupt segment on first open.
- test_client_quarantines_only_on_first_call_per_palace verifies the
  cache gate prevents re-running quarantine across repeated _client()
  calls — important because _client() is hit on every backend op.

Closes MemPalace#1121. Closes MemPalace#1132. Closes MemPalace#1263.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: ruff format touched files (PR MemPalace#1322)

CI requires whole-file format on touched files; pre-existing drift only.

* style: ruff format tests/test_backends.py (PR MemPalace#1322)

* fix(mcp): omit palace_path from tool_status responses (+ docs)

The MCP `mempalace_status` tool was returning the server's absolute
`_config.palace_path` to any connected client on both the main
(ChromaDB-backed) path and the sqlite fallback path that runs when
HNSW divergence is detected (MemPalace#1222). On a single-user local deployment
this is self-disclosure, but in nested-agent or multi-server MCP
topologies the client is a separate trust domain and the absolute
path has no documented client-side use.

Clients that legitimately need the palace path continue to have three
documented channels: the `MEMPALACE_PALACE_PATH` env var (primary) or
its legacy `MEMPAL_PALACE_PATH` alias, the `~/.mempalace/config.json`
file, and the `--palace` CLI flag on most subcommands.

Also corrects stale docs that claimed `mempalace_reconnect` returned a
`palace_path` field; the code returns `{success, message, drawers,
vector_disabled[, vector_disabled_reason]}` on success, plus a no-palace
shape and an exception shape.

- mempalace/mcp_server.py: drop palace_path from tool_status() and
  _tool_status_via_sqlite() result dicts
- website/reference/mcp-tools.md: update documented return shapes for
  mempalace_status (fix) and mempalace_reconnect (stale-docs correction)

Authored-by: Aaron Salsitz (ICCI LLC, @icciaaron). Claude Code was used
as an authoring and review-orchestration tool, with human-in-the-loop
oversight at every step: Aaron wrote the prompts, reviewed each draft,
called for three independent review passes (drafting / post-rebase
technical / CISA-aligned disclosure-leak), and verified the final patch
behavior before commit.

* fix(mcp): basename source_file in tool_get_drawer responses

The MCP `mempalace_get_drawer` tool returned the entire raw drawer
metadata blob to any connected client, and the `source_file` field
in that blob is the absolute filesystem path written by the miners
(`miner.py`, `convo_miner.py` — `source_file = str(filepath)`). On
a single-user local deployment this is self-disclosure, but in
nested-agent or multi-server MCP topologies the client is a separate
trust domain and the host's directory layout has no documented
client-side use.

Mirror the mitigation that `searcher.search_memories()` already applies
on its own return path: reduce `source_file` to its basename via
`Path(source_file).name` before handing the metadata to the client.
Citations still work — the directory layout does not leak.

Companion to #1 (omit palace_path from tool_status). Same threat class,
different surface:

- mempalace_status — palace dir path     → fixed in #1
- mempalace_get_drawer — per-drawer source_file path → this PR

Other read tools were audited and do not leak host paths:
- mempalace_search    — already basenames source_file
- mempalace_list_drawers — returns wing/room/preview only
- mempalace_diary_read   — date/timestamp/topic/content only
- mempalace_reconnect    — success/message/drawers only
- mempalace_kg_*         — entity/predicate strings, counts
- mempalace_check_duplicate — wing/room/preview only

Changes:
- mempalace/mcp_server.py: tool_get_drawer() now basenames metadata.source_file
- tests/test_mcp_server.py: regression test asserting the absolute path
  and its parent directory do not appear anywhere in the response
- website/reference/mcp-tools.md: clarify the documented return shape

* feat(searcher): add candidate_strategy="union" for vector∪BM25 reranking pool

Default search behavior is unchanged. Opt-in candidate_strategy="union"
also pulls top-K BM25-only candidates from sqlite FTS5 and merges them
into the rerank pool, catching docs with strong BM25 signal that the
vector index didn't surface in the over-fetch window.

Motivation
----------
The current hybrid path gathers candidates from the vector index only
(n_results * 3 over-fetch), then BM25-reranks within them. When the
query embeds close to the wrong content semantically, the right doc
never enters the rerank pool — *no matter how wide the over-fetch*.
Tested on a ~6K-document mixed corpus (knowledge prose + short structured
records): at *30x* over-fetch (~5% of the corpus) the target doc still
didn't surface for narrative-shaped queries targeting terminology guides.
Wider over-fetch isn't the answer; widening the pool's *source* is.

Concrete failure mode: a narrative-shaped query embeds close to records
sharing the same operational vocabulary (other narrative entries in the
corpus). A terminology / style guide is BM25-strong for the query
(rare keywords the guide repeats) but vector-distant. Vector-only
candidates don't include it; BM25 never gets to rerank it. The hybrid
path produces 0.00 recall on a probe that pure BM25 alone scores 1.00 —
the hybrid is worse than its component on the same input.

Behavior change
---------------
* New parameter ``candidate_strategy: str = "vector"`` on ``search_memories``.
  - ``"vector"`` (default): historical behavior, no change.
  - ``"union"``: also fetch top ``n_results * 3`` candidates via the
    existing ``_bm25_only_via_sqlite`` helper, dedupe by source_file,
    merge into the rerank pool. BM25-only candidates carry
    ``distance=None`` so they're scored on BM25 contribution alone
    (vec_sim coerces to 0).
* ``_hybrid_rank`` now handles ``distance=None`` explicitly, scoring
  such candidates as vector-unknown (vec_sim=0) rather than treating
  it as max-distance via shim.
* New strategies register via ``_CANDIDATE_MERGERS``; dispatch is in
  ``_apply_candidate_strategy`` so ``search_memories`` stays under the
  C901 complexity ceiling.

Bench numbers (~6K-doc internal mixed corpus, recall@10, 5 probes spanning
policy-exception lookup, temporal-decay, style retrieval, set-difference,
and pattern-recognition):

                              baseline ("vector")   "union"
  policy-exception probe        0.00                  0.50    +0.50
  temporal-decay probe          0.17                  0.50    +0.33
  style-retrieval probe         0.00                  1.00    +1.00 (PASSES)
  set-difference probe          0.00–0.06             0.06–0.09  ~
  pattern-recog probe           0.64 (stable)         0.50–0.71  variance, typ. +0.07
  macro recall                  0.16–0.17             0.51–0.56  +0.34 to +0.40

The pattern-recog variance points at a related issue worth a separate PR:
``_hybrid_rank`` computes BM25 IDF over the candidate set. Adding new
candidates re-normalizes BM25 for *existing* candidates non-monotonically.
Stable corpus-wide BM25 would remove this. Out of scope here.

Tests
-----
``tests/test_hybrid_candidate_union.py`` (6 tests, all pass):
- default behavior unchanged (explicit ``"vector"`` matches default)
- ``"union"`` surfaces a BM25-strong vector-distant doc
- ``"union"`` doesn't drop docs ``"vector"`` would have found
- empty-palace handling
- invalid ``candidate_strategy`` raises
- ``_hybrid_rank`` tolerates ``distance=None``

Existing ``test_hybrid_search.py`` (5) and ``test_searcher.py`` (27) pass.

Performance note
----------------
Each ``"union"`` query adds one sqlite open + FTS5 MATCH + metadata
fetch (via the existing ``_bm25_only_via_sqlite`` helper, which already
runs as the ``vector_disabled`` fallback path so the code is
well-trodden). Per-query overhead is small but unmeasured at corpus
scale. Default stays ``"vector"`` until a maintainer characterizes the
cost.

* fix(searcher): address Copilot review on MemPalace#1306

- Dedup union candidates by (full_path, chunk_index), not basename —
  two files sharing a basename in different dirs no longer collide,
  and a vector hit on chunk N of a file no longer blocks BM25 from
  contributing chunk M of the same file.
- Validate candidate_strategy at the top of search_memories so invalid
  values fail consistently, not only when the call routes through the
  vector path.
- Trim hits back to n_results after the union+rerank pool grows;
  preserves the existing search_memories size contract that the MCP
  limit parameter is built on.
- Skip BM25-only injection when max_distance > 0.0; BM25-only
  candidates carry distance=None and would silently bypass the
  caller's strict vector-distance threshold.

Adds 4 tests covering: validation under vector_disabled, n_results
trim, max_distance honoring, and basename-collision dedup.

* fix(mcp): forward valid_to and source params in kg_add/kg_invalidate (MemPalace#1314)

`tool_kg_add` previously accepted only `valid_from` and `source_closet`,
silently dropping `valid_to`, `source_file`, and `source_drawer_id` at
the MCP boundary. Backfilling already-ended historical facts therefore
collapsed to "still current," and adapter provenance never reached
the SQLite layer even though `KnowledgeGraph.add_triple` already
supported every column.

`tool_kg_invalidate` returned the literal string `"today"` whenever the
caller omitted `ended`, hiding the actual stamped date from anyone trying
to verify what got persisted.

Changes:
- Extend `tool_kg_add` signature + MCP input_schema with `valid_to`,
  `source_file`, `source_drawer_id`; forward all of them to
  `_kg.add_triple` and to the WAL log.
- Resolve `ended` to `date.today().isoformat()` in `tool_kg_invalidate`
  before logging / returning, so the response always reports the actual
  date stored in `valid_to`.
- Add regression tests for valid_to round-trip, source_file /
  source_drawer_id provenance, and the resolved-ended-date contract.
- Leave TODO(MemPalace#1283) markers so the open ISO-8601 validation PR can drop
  `validate_iso_date` over `valid_from` / `valid_to` / `ended` cleanly.

The underlying `KnowledgeGraph.add_triple` already accepted these
kwargs (RFC 002 §5.5) — only the MCP edge needed wiring up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: ruff format test_mcp_server.py (PR MemPalace#1320)

* docs(mcp): drop §5.5 from kg_add docstring/schema

The repo's anti-jargon meta-test bans §N markers outside the
sources/backends allowlist. mcp_server.py isn't allowlisted, so the
"RFC 002 §5.5" references added in this PR turned the test red.
Trim to "RFC 002" — section number isn't load-bearing for the description.

* fix(cli): honor --palace flag in cmd_init (MemPalace#1313)

cmd_init was instantiating MempalaceConfig() unconditionally, ignoring
args.palace and always writing the palace under ~/.mempalace. Mirror
the env-var pattern used by mcp_server.py (and consistent with how
cmd_mine / cmd_status / cmd_search resolve --palace) so every
downstream read of cfg.palace_path inside cmd_init — Pass 0,
cfg.init(), and the post-init mine — routes to the user-specified
location.

Adds tests/test_cli.py::test_cmd_init_honors_palace_flag covering the
regression: asserts Pass 0 receives the --palace value (not
~/.mempalace) and that MEMPALACE_PALACE_PATH is set in os.environ.

Closes MemPalace#1313.

* test(cli): prime monkeypatch undo so palace env doesn't leak

monkeypatch.delenv(name, raising=False) on a missing key registers no
undo entry, so the env var cmd_init writes leaked into test_config_from_file
on Python 3.13 / Windows / macOS.

Prime the slot with setenv before delenv so teardown rolls back the write.

* feat(mcp): persist drawer metadata

* fix: address metadata passthrough review blockers

* Close metadata passthrough review findings

F03 now logs corrupted drawer metadata decode failures through the mempalace_mcp logger, truncating the stored value to value_head so operators get an identifier without leaking the full payload.

F15 is Case B: ChromaBackend._resolve_embedding_function() can return None because it catches any exception from get_embedding_function() and returns None after logging; ChromaBackend.get_collection() mirrors the same conditional kwargs behavior and does not repair that case.

For the MCP _get_collection call site, treat a None EF as a fail-closed condition and never reopen a collection without embedding_function=, so the MemPalace#1299 default-EF regression path cannot silently reappear there.

Validation: uv run pytest -q; uv run ruff check .

---------

Co-authored-by: Igor Lins e Silva <4753812+igorls@users.noreply.github.com>
Co-authored-by: Mikhail Valentsev <michael@valentsev.ru>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: icciAaron <asalsitz@icci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP server SIGSEGV on tool_diary_write in 3.3.4 — Connection closed; server unrecoverable in-session

2 participants