fix(search): aggressive over-fetch for kind=content/checkpoint

jphein · claude · jphein · commit f9f5cc464174 · 2026-04-25T12:09:54.000-07:00
Companion to 398f42f's filter-planner workaround. After moving the kind= exclusion to the post-filter, search results were dominated by CHECKPOINT diary entries because the post-filter doesn't reach far enough into the ranking — top-10 hits for typical content queries on the canonical 151K palace were ALL checkpoints (sims 0.30-0.44), because checkpoints are short word-dense user-prompt snippets that embed strongly to many queries and out-rank the longer substantive content drawers. Without over-fetch, post-filter empties the candidate set entirely: limit=5 → vector returns 5 → 5 checkpoints → post-filter drops 5 → "vector ranked 0" warning even though content drawers exist further down the ranking. Fix: pull max(n_results*20, 100) candidates when kind != "all", so the post-filter has somewhere to find substantive content. The 3x over-fetch for kind="all" stays — no post-filter runs, no need to over-pull. Trade-off: kind=content vector queries fetch ~100 candidates typically. Negligible cost given chromadb HNSW is fast on top-N, and this is the difference between "kind=content returns useful content" vs "kind=content returns empty." Same root cause as the engram-2 critique: vector recall is high (content drawers ARE in the index, findable by query) but checkpoint shape dominates ranking. Over-fetch is the surgical fix; structural fix is to stop indexing checkpoints as searchable drawers (separate session-recovery table) — captured in roadmap, not in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/mempalace/searcher.py b/mempalace/searcher.py
@@ -658,10 +658,24 @@ def search_memories(
     # and closet-first routing hides drawers that direct search would find.
     warnings: list[str] = []
     drawer_results: dict = {"documents": [[]], "metadatas": [[]], "distances": [[]]}
+    # Over-fetch for re-ranking + post-filter survival.
+    #
+    # When kind != "all", _apply_kind_text_filter drops checkpoints
+    # (or non-checkpoints, for kind="checkpoint") from the candidate
+    # pool. On a checkpoint-heavy palace, top-N vector hits are
+    # dominated by CHECKPOINT diary entries (short, word-dense, embed
+    # strongly) — observed 2026-04-25 on the 151K-drawer canonical
+    # palace where top-10 hits were all checkpoints for typical
+    # content queries. Without aggressive over-fetch the post-filter
+    # empties the result set even when substantive content drawers
+    # exist further down the ranking. Pull 20× the requested limit
+    # (capped at 100) when filtering applies; keep the cheaper 3×
+    # over-fetch for kind="all" where no post-filter runs.
+    pull_size = max(n_results * 20, 100) if kind != "all" else n_results * 3
     try:
         dkwargs = {
             "query_texts": [query],
-            "n_results": n_results * 3,  # over-fetch for re-ranking
+            "n_results": pull_size,
             "include": ["documents", "metadatas", "distances"],
         }
         if where: