Skip to content

fix(mcp): retry tool_search once on Chroma "Error finding id" transient (#1315)#1396

Merged
igorls merged 2 commits into
MemPalace:developfrom
JPdeB61:fix/1315-mcp-search-retry-on-hnsw-flush
May 9, 2026
Merged

fix(mcp): retry tool_search once on Chroma "Error finding id" transient (#1315)#1396
igorls merged 2 commits into
MemPalace:developfrom
JPdeB61:fix/1315-mcp-search-retry-on-hnsw-flush

Conversation

@JPdeB61

@JPdeB61 JPdeB61 commented May 7, 2026

Copy link
Copy Markdown
Contributor

Problem

After a bulk CLI mine, ChromaDB's HNSW segment metadata can be unflushed for ~30-60s. Wing-scoped MCP search
hits Internal error: Error finding id during that window. The existing inode/mtime cache invalidation in
mcp_server._get_client doesn't fix it — tool_search doesn't go through that cache. It routes via
search_memoriespalace.get_collection_DEFAULT_BACKEND._client(palace_path), which has its own
per-palace _clients + _freshness cache that needs to be invalidated separately.

Closely related to / partial fix for #1315 (and also resembles #1082).

Reproducer

  • Environment: mempalace 3.3.x, chromadb 1.5.x, Windows 11, Python 3.13
  • Long-running MCP server connected (e.g. via Claude Code).
  • From a separate shell: python -m mempalace mine <docs-dir> --wing <name> — large enough to take meaningful
    time (in my case ~1,184 drawers in one batch).
  • Immediately call mempalace_search via MCP with wing="<name>".
{"error": "Search error: Error executing plan: Internal error: Error finding id"}

Persists ~30-60s, self-heals. During the window:

$ python -m mempalace repair-status
[drawers]
  sqlite count:   1,408
  hnsw count:     (no flushed metadata yet)
  status:         UNKNOWN

CLI search (fresh process) on the same palace works — its freshly-loaded Chroma client opens segments after the
 flush has caught up.

Change

tool_search now wraps its search_memories call with a single retry that:

- detects "Internal error" / "Error finding id" in the result dict via _is_transient_index_error(),
- drops the MCP-local cache and _DEFAULT_BACKEND._clients / _freshness for the palace via
_force_chroma_cache_reset(),
- sleeps 2s,
- retries once,
- tags successful retries with "index_recovered": true so callers can observe when it kicked in.

Non-transient errors (e.g. validation failures) bypass the retry path entirely.

Tests

Three unit tests added to TestSearchTool:

- test_search_retries_once_on_hnsw_flush_transient — first call returns the transient, second succeeds, asserts
 retry ran and index_recovered is set.
- test_search_does_not_retry_on_non_transient_error — unrelated errors propagate without retry.
- test_search_returns_second_error_if_retry_also_fails — persistent transient surfaces the second error rather
than looping.

All 12 tests in TestSearchTool pass locally.

What this does NOT fix

- tool_check_duplicate and other index-touching tools still error in the same window. They'd need the same
wrapper or, better, a shared helper.
- The underlying flush window itself. Options 1-3 from #1315 (auto-flush at end of mine, fail-fast detection,
SQLite-only fallback) are still on the table for a complete fix. This is option 4 — a low-risk recovery wrapper
 alongside, not a replacement.

Refs #1315
Refs #1082

---
Happy to take this in a different direction (e.g. lift the helper into a shared decorator covering all
index-touching tools) if you'd prefer.

---

…nt (MemPalace#1315)

After a bulk CLI mine, ChromaDB's HNSW segment metadata can be unflushed
  for ~30-60s. Wing-scoped MCP search hits "Internal error: Error finding id"
  during that window, and the existing inode/mtime cache invalidation isn't
  enough — tool_search routes via search_memories -> palace.get_collection
  -> _DEFAULT_BACKEND._client, which has its own per-palace cache.

  This wraps tool_search with a single retry that drops both the MCP-local
  cache and _DEFAULT_BACKEND._clients/_freshness for the palace, sleeps 2s,
  retries once, and tags successful retries with index_recovered=True.

  Does not address tool_check_duplicate or other index-touching tools, nor
  the underlying flush window — options 1-3 from MemPalace#1315 (auto-flush after
  mine, fail-fast detection, SQLite-only fallback) are still on the table
  for a complete fix.

  Refs MemPalace#1315
@JPdeB61

JPdeB61 commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Opened #1396 with a partial fix — option 4 (retry-once with forced cache invalidation in tool_search)
alongside the three approaches in the OP. Doesn't replace options 1-3 (it only covers tool_search, not e.g.
tool_check_duplicate, and doesn't address the underlying flush window) but resolves the immediate
user-visible failure for the most common path.

Reproducer + diagnostics + cache-layer notes are in the PR description.

@igorls igorls added bug Something isn't working area/mcp MCP server and tools storage labels May 8, 2026
@igorls igorls merged commit d9e60d8 into MemPalace:develop May 9, 2026
4 of 6 checks passed
arnoldwender pushed a commit to arnoldwender/mempalace that referenced this pull request May 10, 2026
Bumps version 3.3.4 → 3.3.5 across pyproject.toml, version.py, plugin
manifests, README badge, and uv.lock. Flips CHANGELOG.md from
``[3.3.5] — unreleased`` to ``[3.3.5] — 2026-05-09`` and adds entries
for the four PRs that landed after the bug-fix block was authored:

- Bug Fixes: MemPalace#1396 (tool_search retry on transient HNSW flush)
- Documentation: MemPalace#1385 (CONTRIBUTING git-identity guidance, closes MemPalace#1317)
- Internal: MemPalace#1431 (test multiprocessing fork → spawn)
- Internal: MemPalace#1430 (test sqlite connection lifecycle via contextlib.closing)

The four open issues remaining on the v3.3.5 milestone (MemPalace#1266, MemPalace#1253,
MemPalace#1092, MemPalace#1082) have been moved to v3.4 — they form the concurrent-writer
/ HNSW corruption cluster that needs deeper work than this cycle could
absorb.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mcp MCP server and tools bug Something isn't working storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants