fix(mcp): retry tool_search once on Chroma "Error finding id" transient (#1315) by JPdeB61 · Pull Request #1396 · MemPalace/mempalace

JPdeB61 · 2026-05-07T00:29:10Z

Problem

After a bulk CLI mine, ChromaDB's HNSW segment metadata can be unflushed for ~30-60s. Wing-scoped MCP search
hits Internal error: Error finding id during that window. The existing inode/mtime cache invalidation in
mcp_server._get_client doesn't fix it — tool_search doesn't go through that cache. It routes via
search_memories → palace.get_collection → _DEFAULT_BACKEND._client(palace_path), which has its own
per-palace _clients + _freshness cache that needs to be invalidated separately.

Closely related to / partial fix for #1315 (and also resembles #1082).

Reproducer

Environment: mempalace 3.3.x, chromadb 1.5.x, Windows 11, Python 3.13
Long-running MCP server connected (e.g. via Claude Code).
From a separate shell: python -m mempalace mine <docs-dir> --wing <name> — large enough to take meaningful
time (in my case ~1,184 drawers in one batch).
Immediately call mempalace_search via MCP with wing="<name>".

{"error": "Search error: Error executing plan: Internal error: Error finding id"}

Persists ~30-60s, self-heals. During the window:

$ python -m mempalace repair-status
[drawers]
  sqlite count:   1,408
  hnsw count:     (no flushed metadata yet)
  status:         UNKNOWN

CLI search (fresh process) on the same palace works — its freshly-loaded Chroma client opens segments after the
 flush has caught up.

Change

tool_search now wraps its search_memories call with a single retry that:

- detects "Internal error" / "Error finding id" in the result dict via _is_transient_index_error(),
- drops the MCP-local cache and _DEFAULT_BACKEND._clients / _freshness for the palace via
_force_chroma_cache_reset(),
- sleeps 2s,
- retries once,
- tags successful retries with "index_recovered": true so callers can observe when it kicked in.

Non-transient errors (e.g. validation failures) bypass the retry path entirely.

Tests

Three unit tests added to TestSearchTool:

- test_search_retries_once_on_hnsw_flush_transient — first call returns the transient, second succeeds, asserts
 retry ran and index_recovered is set.
- test_search_does_not_retry_on_non_transient_error — unrelated errors propagate without retry.
- test_search_returns_second_error_if_retry_also_fails — persistent transient surfaces the second error rather
than looping.

All 12 tests in TestSearchTool pass locally.

What this does NOT fix

- tool_check_duplicate and other index-touching tools still error in the same window. They'd need the same
wrapper or, better, a shared helper.
- The underlying flush window itself. Options 1-3 from #1315 (auto-flush at end of mine, fail-fast detection,
SQLite-only fallback) are still on the table for a complete fix. This is option 4 — a low-risk recovery wrapper
 alongside, not a replacement.

Refs #1315
Refs #1082

---
Happy to take this in a different direction (e.g. lift the helper into a shared decorator covering all
index-touching tools) if you'd prefer.

---

…nt (MemPalace#1315) After a bulk CLI mine, ChromaDB's HNSW segment metadata can be unflushed for ~30-60s. Wing-scoped MCP search hits "Internal error: Error finding id" during that window, and the existing inode/mtime cache invalidation isn't enough — tool_search routes via search_memories -> palace.get_collection -> _DEFAULT_BACKEND._client, which has its own per-palace cache. This wraps tool_search with a single retry that drops both the MCP-local cache and _DEFAULT_BACKEND._clients/_freshness for the palace, sleeps 2s, retries once, and tags successful retries with index_recovered=True. Does not address tool_check_duplicate or other index-touching tools, nor the underlying flush window — options 1-3 from MemPalace#1315 (auto-flush after mine, fail-fast detection, SQLite-only fallback) are still on the table for a complete fix. Refs MemPalace#1315

JPdeB61 · 2026-05-07T00:31:27Z

Opened #1396 with a partial fix — option 4 (retry-once with forced cache invalidation in tool_search)
alongside the three approaches in the OP. Doesn't replace options 1-3 (it only covers tool_search, not e.g.
tool_check_duplicate, and doesn't address the underlying flush window) but resolves the immediate
user-visible failure for the most common path.

Reproducer + diagnostics + cache-layer notes are in the PR description.

Bumps version 3.3.4 → 3.3.5 across pyproject.toml, version.py, plugin manifests, README badge, and uv.lock. Flips CHANGELOG.md from ``[3.3.5] — unreleased`` to ``[3.3.5] — 2026-05-09`` and adds entries for the four PRs that landed after the bug-fix block was authored: - Bug Fixes: MemPalace#1396 (tool_search retry on transient HNSW flush) - Documentation: MemPalace#1385 (CONTRIBUTING git-identity guidance, closes MemPalace#1317) - Internal: MemPalace#1431 (test multiprocessing fork → spawn) - Internal: MemPalace#1430 (test sqlite connection lifecycle via contextlib.closing) The four open issues remaining on the v3.3.5 milestone (MemPalace#1266, MemPalace#1253, MemPalace#1092, MemPalace#1082) have been moved to v3.4 — they form the concurrent-writer / HNSW corruption cluster that needs deeper work than this cycle could absorb.

JPdeB61 requested review from bensig, igorls and milla-jovovich as code owners May 7, 2026 00:29

igorls added bug Something isn't working area/mcp MCP server and tools storage labels May 8, 2026

style: ruff format tests/test_mcp_server.py for ruff <0.5

b2ce45d

igorls merged commit d9e60d8 into MemPalace:develop May 9, 2026
4 of 6 checks passed

This was referenced May 9, 2026

chore(tests): wrap sqlite3 connections in contextlib.closing #1430

Merged

fix(tests): use spawn instead of fork for lock-test subprocesses #1431

Merged

chore(release): 3.3.5 #1432

Merged

chore(release): 3.3.5 #1434

Merged

jphein mentioned this pull request May 11, 2026

chore: sync upstream/develop through v3.3.5 techempower-org/mempalace#18

Merged

4 tasks

dergachoff mentioned this pull request May 11, 2026

bug: 3.3.5 repair re-quarantines rebuilt HNSW #1451

Closed

StarshipSuperjam mentioned this pull request May 13, 2026

Expose hnsw:sync_threshold via --hnsw-sync-threshold N on init + repair (recurring 'never flushed metadata' for low-volume MCP workloads) #1489

Open

meretrout mentioned this pull request May 15, 2026

MCP tool_search returns "Error finding id" when wing-scoped to a convos-mined wing (CLI works, unscoped MCP works) #1082

Open

trek-e mentioned this pull request May 22, 2026

repair tool cannot recover palace with combined HNSW dimensionality + FTS5 inverted-index corruption (313k drawers) #1586

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp): retry tool_search once on Chroma "Error finding id" transient (#1315)#1396

fix(mcp): retry tool_search once on Chroma "Error finding id" transient (#1315)#1396
igorls merged 2 commits into
MemPalace:developfrom
JPdeB61:fix/1315-mcp-search-retry-on-hnsw-flush

JPdeB61 commented May 7, 2026

Uh oh!

JPdeB61 commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JPdeB61 commented May 7, 2026

Problem

Reproducer

Uh oh!

JPdeB61 commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants