feat(searcher): candidate_strategy="union" — BM25 candidates joined with vector pool before hybrid rerank by igorls · Pull Request #1306 · MemPalace/mempalace

igorls · 2026-05-02T03:50:44Z

Summary

Adds an opt-in candidate_strategy="union" to search_memories that pulls top-K BM25-only candidates from sqlite FTS5 and merges them into the hybrid rerank pool. Default behavior ("vector") is unchanged.

The current hybrid path gathers candidates from the vector index only (n_results * 3 over-fetch), then BM25-reranks within them. When the query embeds close to the wrong content semantically, the right doc never enters the rerank pool — no matter how wide the over-fetch. Validated against a ~6K-document mixed corpus (knowledge prose + short structured records): at 30x over-fetch (~5% of the corpus), the target doc still didn't surface for narrative-shaped queries targeting terminology guides. Wider over-fetch isn't the answer; widening the pool's source is.

The concrete failure mode: a narrative-shaped query embeds close to records sharing the same operational vocabulary. A terminology / style guide is BM25-strong for the query but vector-distant. Vector-only candidates don't include it; BM25 never gets to rerank it. The hybrid path produces 0.00 recall on a probe that pure BM25 alone scores 1.00 — the hybrid is worse than its component on the same input.

What changed

New parameter candidate_strategy: str = "vector" on search_memories.
- "vector" (default): historical behavior, no change.
- "union": also fetch top n_results * 3 candidates via the existing _bm25_only_via_sqlite helper, dedupe by source_file, merge into the rerank pool. BM25-only candidates carry distance=None so they're scored on BM25 contribution alone.
_hybrid_rank now handles distance=None explicitly (vec_sim=0) instead of relying on a sentinel value.
New strategies register via _CANDIDATE_MERGERS; dispatch is in _apply_candidate_strategy so search_memories stays under the project's C901 complexity ceiling.

Bench numbers

~6K-doc internal mixed corpus, recall@10, 5 probes spanning policy-exception lookup, temporal-decay, style retrieval, set-difference, and pattern-recognition:

probe	baseline (`"vector"`)	`"union"`	delta
policy-exception	0.00	0.50	+0.50
temporal-decay	0.17	0.50	+0.33
style-retrieval	0.00	1.00	+1.00 (PASSES)
set-difference	0.00–0.06	0.06–0.09	~
pattern-recog	0.64 (stable)	0.50–0.71 (typ. 0.71)	typ. +0.07, occ. −0.14
macro recall	0.16–0.17	0.51–0.56	+0.34 to +0.40

The pattern-recog variance points at a related issue worth its own PR: _hybrid_rank computes BM25 IDF over the candidate set, so adding new candidates re-normalizes BM25 for existing candidates non-monotonically. Stable corpus-wide BM25 would remove this. Out of scope here.

Tests

tests/test_hybrid_candidate_union.py — 6 new tests, all passing:

default behavior unchanged (explicit "vector" matches default)
"union" surfaces a BM25-strong vector-distant doc
"union" doesn't drop docs "vector" would have found
empty-palace handling
invalid candidate_strategy raises ValueError
_hybrid_rank tolerates distance=None

Existing test_hybrid_search.py (5) and test_searcher.py (27) pass — no regressions.

Performance note

Each "union" query adds one sqlite open + FTS5 MATCH + metadata fetch (via the existing _bm25_only_via_sqlite helper, which already runs as the vector_disabled fallback path so the code is well-trodden). Per-query overhead is small but unmeasured at corpus scale. Default stays "vector" until a maintainer characterizes the cost.

Caveats / followups not in this PR

Performance impact on large palaces — needs measurement before flipping the default.
The local-BM25 IDF instability called out in the bench numbers above (_hybrid_rank re-normalizes existing candidates' BM25 when new ones are added) is its own architectural concern. Worth fixing in a separate PR.
MCP surface — exposing candidate_strategy to MCP callers is straightforward (one extra arg) but I haven't done it here. Easy to add once the Python API lands.

Test plan

pytest tests/test_hybrid_candidate_union.py — 6/6
pytest tests/test_hybrid_search.py — 5/5 no regression
pytest tests/test_searcher.py — 27/27 no regression
ruff check clean
ruff format clean

…ing pool Default search behavior is unchanged. Opt-in candidate_strategy="union" also pulls top-K BM25-only candidates from sqlite FTS5 and merges them into the rerank pool, catching docs with strong BM25 signal that the vector index didn't surface in the over-fetch window. Motivation ---------- The current hybrid path gathers candidates from the vector index only (n_results * 3 over-fetch), then BM25-reranks within them. When the query embeds close to the wrong content semantically, the right doc never enters the rerank pool — *no matter how wide the over-fetch*. Tested on a ~6K-document mixed corpus (knowledge prose + short structured records): at *30x* over-fetch (~5% of the corpus) the target doc still didn't surface for narrative-shaped queries targeting terminology guides. Wider over-fetch isn't the answer; widening the pool's *source* is. Concrete failure mode: a narrative-shaped query embeds close to records sharing the same operational vocabulary (other narrative entries in the corpus). A terminology / style guide is BM25-strong for the query (rare keywords the guide repeats) but vector-distant. Vector-only candidates don't include it; BM25 never gets to rerank it. The hybrid path produces 0.00 recall on a probe that pure BM25 alone scores 1.00 — the hybrid is worse than its component on the same input. Behavior change --------------- * New parameter ``candidate_strategy: str = "vector"`` on ``search_memories``. - ``"vector"`` (default): historical behavior, no change. - ``"union"``: also fetch top ``n_results * 3`` candidates via the existing ``_bm25_only_via_sqlite`` helper, dedupe by source_file, merge into the rerank pool. BM25-only candidates carry ``distance=None`` so they're scored on BM25 contribution alone (vec_sim coerces to 0). * ``_hybrid_rank`` now handles ``distance=None`` explicitly, scoring such candidates as vector-unknown (vec_sim=0) rather than treating it as max-distance via shim. * New strategies register via ``_CANDIDATE_MERGERS``; dispatch is in ``_apply_candidate_strategy`` so ``search_memories`` stays under the C901 complexity ceiling. Bench numbers (~6K-doc internal mixed corpus, recall@10, 5 probes spanning policy-exception lookup, temporal-decay, style retrieval, set-difference, and pattern-recognition): baseline ("vector") "union" policy-exception probe 0.00 0.50 +0.50 temporal-decay probe 0.17 0.50 +0.33 style-retrieval probe 0.00 1.00 +1.00 (PASSES) set-difference probe 0.00–0.06 0.06–0.09 ~ pattern-recog probe 0.64 (stable) 0.50–0.71 variance, typ. +0.07 macro recall 0.16–0.17 0.51–0.56 +0.34 to +0.40 The pattern-recog variance points at a related issue worth a separate PR: ``_hybrid_rank`` computes BM25 IDF over the candidate set. Adding new candidates re-normalizes BM25 for *existing* candidates non-monotonically. Stable corpus-wide BM25 would remove this. Out of scope here. Tests ----- ``tests/test_hybrid_candidate_union.py`` (6 tests, all pass): - default behavior unchanged (explicit ``"vector"`` matches default) - ``"union"`` surfaces a BM25-strong vector-distant doc - ``"union"`` doesn't drop docs ``"vector"`` would have found - empty-palace handling - invalid ``candidate_strategy`` raises - ``_hybrid_rank`` tolerates ``distance=None`` Existing ``test_hybrid_search.py`` (5) and ``test_searcher.py`` (27) pass. Performance note ---------------- Each ``"union"`` query adds one sqlite open + FTS5 MATCH + metadata fetch (via the existing ``_bm25_only_via_sqlite`` helper, which already runs as the ``vector_disabled`` fallback path so the code is well-trodden). Per-query overhead is small but unmeasured at corpus scale. Default stays ``"vector"`` until a maintainer characterizes the cost.

Copilot

Pull request overview

This PR adds an opt-in candidate_strategy="union" mode to search_memories, aiming to improve hybrid retrieval by merging BM25/FTS5 candidates into the rerank pool instead of relying on vector candidates alone. It extends the searcher’s hybrid-ranking path and adds targeted tests for the new strategy and distance=None handling.

Changes:

Add candidate_strategy to search_memories with "vector" and "union" dispatch paths.
Allow _hybrid_rank to handle BM25-only candidates by treating distance=None as no vector signal.
Add a new test module covering default behavior, union retrieval, invalid strategies, empty palaces, and missing-distance reranking.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
`mempalace/searcher.py`	Adds union-candidate merging, strategy dispatch, API docs, and `distance=None` handling in hybrid reranking.
`tests/test_hybrid_candidate_union.py`	Adds regression/integration tests for the new union candidate strategy and BM25-only hybrid reranking behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    seen_sources = {h.get("source_file") for h in hits}
+    for bh in bm25_extra:
+        key = bh.get("source_file")
+        if not key or key == "?" or key in seen_sources:
+            continue


+    seen_sources = {h.get("source_file") for h in hits}
+    for bh in bm25_extra:
+        key = bh.get("source_file")
+        if not key or key == "?" or key in seen_sources:
+            continue


    n_results: int = 5,
    max_distance: float = 0.0,
    vector_disabled: bool = False,
+    candidate_strategy: str = "vector",


+        # The invariant is: union should not lose anything vector found.
+        missing = vec_ids - union_ids
+        assert not missing, f"union dropped docs that vector found: {missing}"
+


+        get_collection(palace, create=True)  # create empty collection
+        result = search_memories("anything", palace, n_results=5, candidate_strategy="union")
+        assert result.get("results", []) == []
+


+    # Candidate strategy hook: optionally widen the rerank pool's *source*
+    # before ranking. Default ("vector") is a no-op; "union" merges top-K
+    # BM25 candidates from sqlite. See `_apply_candidate_strategy`.
+    _apply_candidate_strategy(candidate_strategy, hits, query, palace_path, wing, room, n_results)


+    # before ranking. Default ("vector") is a no-op; "union" merges top-K
+    # BM25 candidates from sqlite. See `_apply_candidate_strategy`.
+    _apply_candidate_strategy(candidate_strategy, hits, query, palace_path, wing, room, n_results)
+


- Dedup union candidates by (full_path, chunk_index), not basename — two files sharing a basename in different dirs no longer collide, and a vector hit on chunk N of a file no longer blocks BM25 from contributing chunk M of the same file. - Validate candidate_strategy at the top of search_memories so invalid values fail consistently, not only when the call routes through the vector path. - Trim hits back to n_results after the union+rerank pool grows; preserves the existing search_memories size contract that the MCP limit parameter is built on. - Skip BM25-only injection when max_distance > 0.0; BM25-only candidates carry distance=None and would silently bypass the caller's strict vector-distance threshold. Adds 4 tests covering: validation under vector_disabled, n_results trim, max_distance honoring, and basename-collision dedup.

MemPalace#1306 (hybrid candidate union, merged 2026-05-02) added a second BM25 scoring site inside _bm25_only_via_sqlite that this PR did not cover -- the original wiring landed before MemPalace#1306 existed. Without the plumbing, two paths silently bypass locale stop_words even when the palace has MEMPALACE_LANG set: - vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without filtering. - candidate_strategy="union": BM25 candidates merged into the rerank pool come from a tokenizer that ignored the configured locale, so the merged-in entries fight the lang-aware _hybrid_rank rerank. Resolution moves once: stop_words = _resolve_stop_words(lang) is hoisted before the vector_disabled branch and threaded through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index is built with a trigram tokenizer that already mismatches our \\w{2,} regex, so dropping further tokens there changes selection semantics in ways outside this PR's scope. Three tests pin the propagation: a sqlite-backed test for _bm25_only_via_sqlite -> _bm25_scores, a unit spy on _apply_candidate_strategy -> merger, and an end-to-end on search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.

Catches up on a month of upstream work. Highlights pulled in: - MemPalace#1306 searcher: hybrid candidate union (vector ∪ BM25 reranking pool) - MemPalace#1325 mcp security: omit absolute paths from tool_get_drawer / tool_status - MemPalace#1322 chroma: wire quarantine_stale_hnsw to prevent SIGSEGV on stale HNSW - MemPalace#1320 mcp: forward valid_to / source params in kg_add / kg_invalidate - MemPalace#1321 cli: honor --palace flag in cmd_init - MemPalace#1314 kg temporal params fix - MemPalace#1244 cli: cmd_compress writes to mempalace_closets so palace can read - MemPalace#1243 mcp: case-insensitive agent name in diary_write/diary_read - MemPalace#1303 mcp_server: pass embedding_function= on collection reopen - MemPalace#1076/MemPalace#1077 hooks: quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json - Various ruff format passes on touched files Conflict resolution (CHANGELOG.md only — code files all auto-merged): - 3.3.5 unreleased section header from upstream kept above 3.3.4 - 3.3.4 section: kept our 2026-04-30 release date; merged upstream's new MemPalace#1299 SIGSEGV-on-default-EF entry in alongside our existing topic-tunnels (MemPalace#1194/MemPalace#1195/MemPalace#1197), HNSW-bloat (MemPalace#1191), max_seq_id (MemPalace#1135), and auto-ingest (MemPalace#1230/MemPalace#1231) entries. Kept our richer topic-tunnels detail (upstream's version was a strict subset). xdev patches preserved (still on this branch, untouched by merge): - 6ef44cb fix(hooks): route CC transcripts via convo_miner with cwd-based wings - 3fad61d fix(config): allow leading dash in wing names Not pushed to origin — run tests locally and decide when to push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MemPalace#1306 (hybrid candidate union, merged 2026-05-02) added a second BM25 scoring site inside _bm25_only_via_sqlite that this PR did not cover -- the original wiring landed before MemPalace#1306 existed. Without the plumbing, two paths silently bypass locale stop_words even when the palace has MEMPALACE_LANG set: - vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without filtering. - candidate_strategy="union": BM25 candidates merged into the rerank pool come from a tokenizer that ignored the configured locale, so the merged-in entries fight the lang-aware _hybrid_rank rerank. Resolution moves once: stop_words = _resolve_stop_words(lang) is hoisted before the vector_disabled branch and threaded through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index is built with a trigram tokenizer that already mismatches our \\w{2,} regex, so dropping further tokens there changes selection semantics in ways outside this PR's scope. Three tests pin the propagation: a sqlite-backed test for _bm25_only_via_sqlite -> _bm25_scores, a unit spy on _apply_candidate_strategy -> merger, and an end-to-end on search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.

scoring site inside _bm25_only_via_sqlite that this PR did not cover -- the original wiring landed before MemPalace#1306 existed. Without the plumbing, two paths silently bypass locale stop_words even when the palace has MEMPALACE_LANG set: - vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without filtering. - candidate_strategy="union": BM25 candidates merged into the rerank pool come from a tokenizer that ignored the configured locale, so the merged-in entries fight the lang-aware _hybrid_rank rerank. Resolution moves once: stop_words = _resolve_stop_words(lang) is hoisted before the vector_disabled branch and threaded through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index is built with a trigram tokenizer that already mismatches our \\w{2,} regex, so dropping further tokens there changes selection semantics in ways outside this PR's scope. Three tests pin the propagation: a sqlite-backed test for _bm25_only_via_sqlite -> _bm25_scores, a unit spy on _apply_candidate_strategy -> merger, and an end-to-end on search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.

…(Phases 3+4) Adds the 'hybrid' candidate_strategy that fuses three retrieval modes in a single search_memories call: 1. vector — existing HNSW candidates (already happens) 2. BM25 — postgres tsvector candidates via _bm25_only_via_postgres 3. graph — AGE knowledge-graph expansion through two channels: a. vector-seeded: top-5 vector drawers → entities they reference → other drawers about the same entities b. NER: cheap capitalized-multi-word regex on query → entity match in AGE → connected drawers All three sources feed the existing hybrid reranker (MemPalace#1306 architecture). BM25-only and graph-only candidates carry distance=None so the reranker scores them on their other signals; graph candidates get a small closet_boost (0.05) to reflect their structural-relevance signal. New helpers in searcher.py: - _bm25_only_via_postgres (Phase 2 — already shipped) - _graph_expand_from_seeds (Phase 3) — vector-seeded entity expansion - _graph_expand_from_entities (Phase 3) — NER-driven entity expansion - _ner_from_query (Phase 3) — capitalized-multi-word + known-entity regex - _merge_hybrid_candidates (Phase 4) — orchestrates all three mempalace_search MCP tool gains candidate_strategy and include_trace parameters. Schema updated so unknown kwargs aren't stripped by the whitelist filter. include_trace surfaces per-source matched_via counts for debugging. Verified end-to-end against live postgres palace: hybrid /search ran in 809ms returning 15 candidates reranked to 5; all top hits relevant to the test query. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

scoring site inside _bm25_only_via_sqlite that this PR did not cover -- the original wiring landed before MemPalace#1306 existed. Without the plumbing, two paths silently bypass locale stop_words even when the palace has MEMPALACE_LANG set: - vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without filtering. - candidate_strategy="union": BM25 candidates merged into the rerank pool come from a tokenizer that ignored the configured locale, so the merged-in entries fight the lang-aware _hybrid_rank rerank. Resolution moves once: stop_words = _resolve_stop_words(lang) is hoisted before the vector_disabled branch and threaded through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index is built with a trigram tokenizer that already mismatches our \\w{2,} regex, so dropping further tokens there changes selection semantics in ways outside this PR's scope. Three tests pin the propagation: a sqlite-backed test for _bm25_only_via_sqlite -> _bm25_scores, a unit spy on _apply_candidate_strategy -> merger, and an end-to-end on search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.

- Dedup union candidates by (full_path, chunk_index), not basename — two files sharing a basename in different dirs no longer collide, and a vector hit on chunk N of a file no longer blocks BM25 from contributing chunk M of the same file. - Validate candidate_strategy at the top of search_memories so invalid values fail consistently, not only when the call routes through the vector path. - Trim hits back to n_results after the union+rerank pool grows; preserves the existing search_memories size contract that the MCP limit parameter is built on. - Skip BM25-only injection when max_distance > 0.0; BM25-only candidates carry distance=None and would silently bypass the caller's strict vector-distance threshold. Adds 4 tests covering: validation under vector_disabled, n_results trim, max_distance honoring, and basename-collision dedup.

@icciaaron

* fix(mcp_server): pass embedding_function= on collection reopen (MemPalace#1299) `mcp_server._get_collection` bypassed `ChromaBackend.get_collection` and called `client.get_collection` / `client.create_collection` without `embedding_function=`. ChromaDB 1.x does not persist the EF identity with the collection, so the MCP server's reopen silently bound chromadb's built-in `DefaultEmbeddingFunction` while the miner / Stop hook ingest path bound `mempalace.embedding.get_embedding_function()`. On bleeding-edge interpreters (python 3.14 + chromadb 1.5.x on Apple Silicon, per MemPalace#1299) the default EF's lazy ONNX provider selection could SIGSEGV the host process on first `col.add()`, killing the MCP stdio server and leaving every subsequent tool call returning `Connection closed` until Claude Code was relaunched. Reads worked because `col.get(ids=...)` and metadata fetches don't invoke the EF; the auto-ingest path worked because mining routes through the backend abstraction. Diary writes were the consistent failure surface. Resolve the EF up front (matching `ChromaBackend._resolve_embedding_function`) and pass it into both reopen branches. Falls back to the chromadb default only if `mempalace.embedding.get_embedding_function` itself raises. Regression test patches the chromadb client class to capture `embedding_function=` on every `get_collection` / `create_collection` call from `_get_collection(create=True)` and `_get_collection()`, and fails if any call omits it. Follow-up to MemPalace#1262 / MemPalace#1289 (which fixed the metadata-mismatch SIGSEGV path); this addresses the EF-mismatch SIGSEGV path on the same surface. * fix(mcp_server): address copilot review on MemPalace#1303 - Resolve the EF inside the two reopen branches that actually call `client.get_collection` / `client.create_collection`, so warm-cache reads stay zero-cost (no `MempalaceConfig()` / `_resolve_providers` on every tool call). - Reuse `ChromaBackend._resolve_embedding_function()` instead of duplicating its try/except + log message + None-fallback. - Reword the inline + CHANGELOG explanation to clarify that ChromaDB 1.x persists the EF *identity* (its `name()`) but not the *instance/ configuration* — `mempalace.embedding` documents this and spoofs `name()` to `"default"` precisely so the identity check passes; the bug was the *provider list* (lazy ONNX selection) silently differing. * fix(hooks): quote CLAUDE_PLUGIN_ROOT / CODEX_PLUGIN_ROOT in hooks.json (MemPalace#1076) (MemPalace#1077) Shell splits hook command on whitespace after variable expansion, breaking paths with spaces (e.g. C:\Users\Richard M on Windows). Wrapping the path in double quotes preserves the token boundary. Fixes the reported Stop/PreCompact pair in .claude-plugin/hooks/hooks.json and applies the same fix to .codex-plugin/hooks.json (SessionStart/Stop/ PreCompact), which carries the identical bug. * fix(cli): write compress output to mempalace_closets so palace can read them (MemPalace#1244) `cmd_compress` was writing AAAK-compressed drawers to a `mempalace_compressed` collection, but every read path (`palace.get_closets_collection`, `searcher.py`, `repair.py`) reads from `mempalace_closets`. Result: for non-mined palaces (or any palace where the user ran `mempalace compress` expecting to backfill the closet/index layer), the compressed output was silently invisible — written to a collection nothing else opens. Fix the writer rather than renaming the readers: "closets" is the user-visible feature name baked into the public API (`get_closets_collection`), the searcher hybrid path, repair/HNSW diagnostics, and docs. Renaming the readers would churn 15+ call sites and the README for no benefit. The compressed AAAK strings are exactly what closets are conceptually — compact pointers scanned by an LLM to locate the right drawer — so they belong in `mempalace_closets`. Tests: - Update `test_cmd_compress_stores_results` to assert the collection name passed to `get_or_create_collection` is `mempalace_closets`. - Add `test_cmd_compress_output_readable_via_get_closets_collection`: end-to-end with a real ChromaBackend, seed a drawer, run cmd_compress, then read back via the same `get_closets_collection` helper that palace.py / searcher use. Regression test for the wrong-collection bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: ruff format cli.py (MemPalace#1244) CI requires ruff format --check on the whole touched file. Pre-existing drift, no logic change. * style: ruff format tests/test_cli.py (PR MemPalace#1319) * fix(mcp): case-insensitive agent name in diary_write/diary_read (MemPalace#1243) `tool_diary_write` stored the `agent` metadata verbatim after `sanitize_name` (which preserves case), while `tool_diary_read` filtered by exact match — so writing as "Claude" and reading as "claude" silently returned zero rows. Both endpoints now lowercase `agent_name` immediately after sanitization. The default per-agent wing slug is also stable across casings since it's derived from the same normalized form. Behavior change: entries written prior to this fix under mixed-case agent names will not match the new lowercase filter; documented under v3.3.5 in CHANGELOG with a `mempalace repair` pointer. Adds a regression test (`test_diary_read_case_insensitive_agent`) and updates the existing `test_diary_write_and_read` to assert the new lowercase agent identity. Closes MemPalace#1243 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: ruff format tests/test_mcp_server.py (PR MemPalace#1323) * fix(backends/chroma): wire quarantine_stale_hnsw into _client() to prevent SIGSEGV on stale HNSW (MemPalace#1121, MemPalace#1132, MemPalace#1263) PR MemPalace#1173 wired quarantine_stale_hnsw into the static make_client() helper but not into the instance _client() method. As a result every non-MCP entry point (CLI mining, search, repair, status) — which all use get_collection / _get_or_create_collection / _client() — skipped the cold-start quarantine pass and could SIGSEGV on a stale HNSW segment left over from a partial flush, replicated palace, or crashed-mid-write. Refactor: extract the (_fix_blob_seq_ids + gated quarantine_stale_hnsw) pre-open pass into a single private static helper ChromaBackend._prepare_palace_for_open(). Both make_client() and _client() now route through it, so the _quarantined_paths once-per- palace-per-process gate is preserved (no runtime thrash on hot paths) and behaviour stays identical — the fix is purely about extending the existing protection to the path that was missing it. Tests: - test_client_quarantines_corrupt_segment_on_first_open mirrors the existing make_client test and verifies _client() actually renames a corrupt segment on first open. - test_client_quarantines_only_on_first_call_per_palace verifies the cache gate prevents re-running quarantine across repeated _client() calls — important because _client() is hit on every backend op. Closes MemPalace#1121. Closes MemPalace#1132. Closes MemPalace#1263. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: ruff format touched files (PR MemPalace#1322) CI requires whole-file format on touched files; pre-existing drift only. * style: ruff format tests/test_backends.py (PR MemPalace#1322) * fix(mcp): omit palace_path from tool_status responses (+ docs) The MCP `mempalace_status` tool was returning the server's absolute `_config.palace_path` to any connected client on both the main (ChromaDB-backed) path and the sqlite fallback path that runs when HNSW divergence is detected (MemPalace#1222). On a single-user local deployment this is self-disclosure, but in nested-agent or multi-server MCP topologies the client is a separate trust domain and the absolute path has no documented client-side use. Clients that legitimately need the palace path continue to have three documented channels: the `MEMPALACE_PALACE_PATH` env var (primary) or its legacy `MEMPAL_PALACE_PATH` alias, the `~/.mempalace/config.json` file, and the `--palace` CLI flag on most subcommands. Also corrects stale docs that claimed `mempalace_reconnect` returned a `palace_path` field; the code returns `{success, message, drawers, vector_disabled[, vector_disabled_reason]}` on success, plus a no-palace shape and an exception shape. - mempalace/mcp_server.py: drop palace_path from tool_status() and _tool_status_via_sqlite() result dicts - website/reference/mcp-tools.md: update documented return shapes for mempalace_status (fix) and mempalace_reconnect (stale-docs correction) Authored-by: Aaron Salsitz (ICCI LLC, @icciaaron). Claude Code was used as an authoring and review-orchestration tool, with human-in-the-loop oversight at every step: Aaron wrote the prompts, reviewed each draft, called for three independent review passes (drafting / post-rebase technical / CISA-aligned disclosure-leak), and verified the final patch behavior before commit. * fix(mcp): basename source_file in tool_get_drawer responses The MCP `mempalace_get_drawer` tool returned the entire raw drawer metadata blob to any connected client, and the `source_file` field in that blob is the absolute filesystem path written by the miners (`miner.py`, `convo_miner.py` — `source_file = str(filepath)`). On a single-user local deployment this is self-disclosure, but in nested-agent or multi-server MCP topologies the client is a separate trust domain and the host's directory layout has no documented client-side use. Mirror the mitigation that `searcher.search_memories()` already applies on its own return path: reduce `source_file` to its basename via `Path(source_file).name` before handing the metadata to the client. Citations still work — the directory layout does not leak. Companion to #1 (omit palace_path from tool_status). Same threat class, different surface: - mempalace_status — palace dir path → fixed in #1 - mempalace_get_drawer — per-drawer source_file path → this PR Other read tools were audited and do not leak host paths: - mempalace_search — already basenames source_file - mempalace_list_drawers — returns wing/room/preview only - mempalace_diary_read — date/timestamp/topic/content only - mempalace_reconnect — success/message/drawers only - mempalace_kg_* — entity/predicate strings, counts - mempalace_check_duplicate — wing/room/preview only Changes: - mempalace/mcp_server.py: tool_get_drawer() now basenames metadata.source_file - tests/test_mcp_server.py: regression test asserting the absolute path and its parent directory do not appear anywhere in the response - website/reference/mcp-tools.md: clarify the documented return shape * feat(searcher): add candidate_strategy="union" for vector∪BM25 reranking pool Default search behavior is unchanged. Opt-in candidate_strategy="union" also pulls top-K BM25-only candidates from sqlite FTS5 and merges them into the rerank pool, catching docs with strong BM25 signal that the vector index didn't surface in the over-fetch window. Motivation ---------- The current hybrid path gathers candidates from the vector index only (n_results * 3 over-fetch), then BM25-reranks within them. When the query embeds close to the wrong content semantically, the right doc never enters the rerank pool — *no matter how wide the over-fetch*. Tested on a ~6K-document mixed corpus (knowledge prose + short structured records): at *30x* over-fetch (~5% of the corpus) the target doc still didn't surface for narrative-shaped queries targeting terminology guides. Wider over-fetch isn't the answer; widening the pool's *source* is. Concrete failure mode: a narrative-shaped query embeds close to records sharing the same operational vocabulary (other narrative entries in the corpus). A terminology / style guide is BM25-strong for the query (rare keywords the guide repeats) but vector-distant. Vector-only candidates don't include it; BM25 never gets to rerank it. The hybrid path produces 0.00 recall on a probe that pure BM25 alone scores 1.00 — the hybrid is worse than its component on the same input. Behavior change --------------- * New parameter ``candidate_strategy: str = "vector"`` on ``search_memories``. - ``"vector"`` (default): historical behavior, no change. - ``"union"``: also fetch top ``n_results * 3`` candidates via the existing ``_bm25_only_via_sqlite`` helper, dedupe by source_file, merge into the rerank pool. BM25-only candidates carry ``distance=None`` so they're scored on BM25 contribution alone (vec_sim coerces to 0). * ``_hybrid_rank`` now handles ``distance=None`` explicitly, scoring such candidates as vector-unknown (vec_sim=0) rather than treating it as max-distance via shim. * New strategies register via ``_CANDIDATE_MERGERS``; dispatch is in ``_apply_candidate_strategy`` so ``search_memories`` stays under the C901 complexity ceiling. Bench numbers (~6K-doc internal mixed corpus, recall@10, 5 probes spanning policy-exception lookup, temporal-decay, style retrieval, set-difference, and pattern-recognition): baseline ("vector") "union" policy-exception probe 0.00 0.50 +0.50 temporal-decay probe 0.17 0.50 +0.33 style-retrieval probe 0.00 1.00 +1.00 (PASSES) set-difference probe 0.00–0.06 0.06–0.09 ~ pattern-recog probe 0.64 (stable) 0.50–0.71 variance, typ. +0.07 macro recall 0.16–0.17 0.51–0.56 +0.34 to +0.40 The pattern-recog variance points at a related issue worth a separate PR: ``_hybrid_rank`` computes BM25 IDF over the candidate set. Adding new candidates re-normalizes BM25 for *existing* candidates non-monotonically. Stable corpus-wide BM25 would remove this. Out of scope here. Tests ----- ``tests/test_hybrid_candidate_union.py`` (6 tests, all pass): - default behavior unchanged (explicit ``"vector"`` matches default) - ``"union"`` surfaces a BM25-strong vector-distant doc - ``"union"`` doesn't drop docs ``"vector"`` would have found - empty-palace handling - invalid ``candidate_strategy`` raises - ``_hybrid_rank`` tolerates ``distance=None`` Existing ``test_hybrid_search.py`` (5) and ``test_searcher.py`` (27) pass. Performance note ---------------- Each ``"union"`` query adds one sqlite open + FTS5 MATCH + metadata fetch (via the existing ``_bm25_only_via_sqlite`` helper, which already runs as the ``vector_disabled`` fallback path so the code is well-trodden). Per-query overhead is small but unmeasured at corpus scale. Default stays ``"vector"`` until a maintainer characterizes the cost. * fix(searcher): address Copilot review on MemPalace#1306 - Dedup union candidates by (full_path, chunk_index), not basename — two files sharing a basename in different dirs no longer collide, and a vector hit on chunk N of a file no longer blocks BM25 from contributing chunk M of the same file. - Validate candidate_strategy at the top of search_memories so invalid values fail consistently, not only when the call routes through the vector path. - Trim hits back to n_results after the union+rerank pool grows; preserves the existing search_memories size contract that the MCP limit parameter is built on. - Skip BM25-only injection when max_distance > 0.0; BM25-only candidates carry distance=None and would silently bypass the caller's strict vector-distance threshold. Adds 4 tests covering: validation under vector_disabled, n_results trim, max_distance honoring, and basename-collision dedup. * fix(mcp): forward valid_to and source params in kg_add/kg_invalidate (MemPalace#1314) `tool_kg_add` previously accepted only `valid_from` and `source_closet`, silently dropping `valid_to`, `source_file`, and `source_drawer_id` at the MCP boundary. Backfilling already-ended historical facts therefore collapsed to "still current," and adapter provenance never reached the SQLite layer even though `KnowledgeGraph.add_triple` already supported every column. `tool_kg_invalidate` returned the literal string `"today"` whenever the caller omitted `ended`, hiding the actual stamped date from anyone trying to verify what got persisted. Changes: - Extend `tool_kg_add` signature + MCP input_schema with `valid_to`, `source_file`, `source_drawer_id`; forward all of them to `_kg.add_triple` and to the WAL log. - Resolve `ended` to `date.today().isoformat()` in `tool_kg_invalidate` before logging / returning, so the response always reports the actual date stored in `valid_to`. - Add regression tests for valid_to round-trip, source_file / source_drawer_id provenance, and the resolved-ended-date contract. - Leave TODO(MemPalace#1283) markers so the open ISO-8601 validation PR can drop `validate_iso_date` over `valid_from` / `valid_to` / `ended` cleanly. The underlying `KnowledgeGraph.add_triple` already accepted these kwargs (RFC 002 §5.5) — only the MCP edge needed wiring up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: ruff format test_mcp_server.py (PR MemPalace#1320) * docs(mcp): drop §5.5 from kg_add docstring/schema The repo's anti-jargon meta-test bans §N markers outside the sources/backends allowlist. mcp_server.py isn't allowlisted, so the "RFC 002 §5.5" references added in this PR turned the test red. Trim to "RFC 002" — section number isn't load-bearing for the description. * fix(cli): honor --palace flag in cmd_init (MemPalace#1313) cmd_init was instantiating MempalaceConfig() unconditionally, ignoring args.palace and always writing the palace under ~/.mempalace. Mirror the env-var pattern used by mcp_server.py (and consistent with how cmd_mine / cmd_status / cmd_search resolve --palace) so every downstream read of cfg.palace_path inside cmd_init — Pass 0, cfg.init(), and the post-init mine — routes to the user-specified location. Adds tests/test_cli.py::test_cmd_init_honors_palace_flag covering the regression: asserts Pass 0 receives the --palace value (not ~/.mempalace) and that MEMPALACE_PALACE_PATH is set in os.environ. Closes MemPalace#1313. * test(cli): prime monkeypatch undo so palace env doesn't leak monkeypatch.delenv(name, raising=False) on a missing key registers no undo entry, so the env var cmd_init writes leaked into test_config_from_file on Python 3.13 / Windows / macOS. Prime the slot with setenv before delenv so teardown rolls back the write. * feat(mcp): persist drawer metadata * fix: address metadata passthrough review blockers * Close metadata passthrough review findings F03 now logs corrupted drawer metadata decode failures through the mempalace_mcp logger, truncating the stored value to value_head so operators get an identifier without leaking the full payload. F15 is Case B: ChromaBackend._resolve_embedding_function() can return None because it catches any exception from get_embedding_function() and returns None after logging; ChromaBackend.get_collection() mirrors the same conditional kwargs behavior and does not repair that case. For the MCP _get_collection call site, treat a None EF as a fail-closed condition and never reopen a collection without embedding_function=, so the MemPalace#1299 default-EF regression path cannot silently reappear there. Validation: uv run pytest -q; uv run ruff check . --------- Co-authored-by: Igor Lins e Silva <4753812+igorls@users.noreply.github.com> Co-authored-by: Mikhail Valentsev <michael@valentsev.ru> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: icciAaron <asalsitz@icci.com>

scoring site inside _bm25_only_via_sqlite that this PR did not cover -- the original wiring landed before MemPalace#1306 existed. Without the plumbing, two paths silently bypass locale stop_words even when the palace has MEMPALACE_LANG set: - vector_disabled=True (MemPalace#1222 fallback): BM25-only scoring runs without filtering. - candidate_strategy="union": BM25 candidates merged into the rerank pool come from a tokenizer that ignored the configured locale, so the merged-in entries fight the lang-aware _hybrid_rank rerank. Resolution moves once: stop_words = _resolve_stop_words(lang) is hoisted before the vector_disabled branch and threaded through _apply_candidate_strategy, _merge_bm25_union_candidates, and _bm25_only_via_sqlite into _bm25_scores. The FTS5 candidate-selection _tokenize at the top of _bm25_only_via_sqlite is left untouched -- chromadb's FTS5 index is built with a trigram tokenizer that already mismatches our \\w{2,} regex, so dropping further tokens there changes selection semantics in ways outside this PR's scope. Three tests pin the propagation: a sqlite-backed test for _bm25_only_via_sqlite -> _bm25_scores, a unit spy on _apply_candidate_strategy -> merger, and an end-to-end on search_memories(vector_disabled=True) -> _bm25_only_via_sqlite.

Copilot AI review requested due to automatic review settings May 2, 2026 03:50

igorls requested review from bensig and milla-jovovich as code owners May 2, 2026 03:50

Copilot started reviewing on behalf of igorls May 2, 2026 03:51 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

igorls added area/search Search and retrieval enhancement New feature or request labels May 2, 2026

igorls merged commit 2ad379b into develop May 3, 2026
6 checks passed

igorls deleted the feat/hybrid-candidate-union branch May 3, 2026 06:40

mvalentsev mentioned this pull request May 3, 2026

feat(searcher): wire i18n stop words into BM25 tokenizer (#973) #977

Open

jphein mentioned this pull request May 3, 2026

feat(searcher): warnings + sqlite BM25 top-up when vector underdelivers #1005

Open

lgertel mentioned this pull request May 5, 2026

merge: sync upstream/develop (25 commits) into fork develop durante-tech/dos-mempalace#3

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(searcher): candidate_strategy="union" — BM25 candidates joined with vector pool before hybrid rerank#1306

feat(searcher): candidate_strategy="union" — BM25 candidates joined with vector pool before hybrid rerank#1306
igorls merged 2 commits into
developfrom
feat/hybrid-candidate-union

igorls commented May 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

igorls commented May 2, 2026

Summary

What changed

Bench numbers

Tests

Performance note

Caveats / followups not in this PR

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants