fix(mcp): retry tool_search once on Chroma "Error finding id" transient (#1315)#1396
Merged
igorls merged 2 commits intoMay 9, 2026
Merged
Conversation
…nt (MemPalace#1315) After a bulk CLI mine, ChromaDB's HNSW segment metadata can be unflushed for ~30-60s. Wing-scoped MCP search hits "Internal error: Error finding id" during that window, and the existing inode/mtime cache invalidation isn't enough — tool_search routes via search_memories -> palace.get_collection -> _DEFAULT_BACKEND._client, which has its own per-palace cache. This wraps tool_search with a single retry that drops both the MCP-local cache and _DEFAULT_BACKEND._clients/_freshness for the palace, sleeps 2s, retries once, and tags successful retries with index_recovered=True. Does not address tool_check_duplicate or other index-touching tools, nor the underlying flush window — options 1-3 from MemPalace#1315 (auto-flush after mine, fail-fast detection, SQLite-only fallback) are still on the table for a complete fix. Refs MemPalace#1315
Contributor
Author
|
Opened #1396 with a partial fix — option 4 (retry-once with forced cache invalidation in Reproducer + diagnostics + cache-layer notes are in the PR description. |
This was referenced May 9, 2026
Merged
Merged
arnoldwender
pushed a commit
to arnoldwender/mempalace
that referenced
this pull request
May 10, 2026
Bumps version 3.3.4 → 3.3.5 across pyproject.toml, version.py, plugin manifests, README badge, and uv.lock. Flips CHANGELOG.md from ``[3.3.5] — unreleased`` to ``[3.3.5] — 2026-05-09`` and adds entries for the four PRs that landed after the bug-fix block was authored: - Bug Fixes: MemPalace#1396 (tool_search retry on transient HNSW flush) - Documentation: MemPalace#1385 (CONTRIBUTING git-identity guidance, closes MemPalace#1317) - Internal: MemPalace#1431 (test multiprocessing fork → spawn) - Internal: MemPalace#1430 (test sqlite connection lifecycle via contextlib.closing) The four open issues remaining on the v3.3.5 milestone (MemPalace#1266, MemPalace#1253, MemPalace#1092, MemPalace#1082) have been moved to v3.4 — they form the concurrent-writer / HNSW corruption cluster that needs deeper work than this cycle could absorb.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
After a bulk CLI mine, ChromaDB's HNSW segment metadata can be unflushed for ~30-60s. Wing-scoped MCP search
hits
Internal error: Error finding idduring that window. The existing inode/mtime cache invalidation inmcp_server._get_clientdoesn't fix it —tool_searchdoesn't go through that cache. It routes viasearch_memories→palace.get_collection→_DEFAULT_BACKEND._client(palace_path), which has its ownper-palace
_clients+_freshnesscache that needs to be invalidated separately.Closely related to / partial fix for #1315 (and also resembles #1082).
Reproducer
python -m mempalace mine <docs-dir> --wing <name>— large enough to take meaningfultime (in my case ~1,184 drawers in one batch).
mempalace_searchvia MCP withwing="<name>".{"error": "Search error: Error executing plan: Internal error: Error finding id"} Persists ~30-60s, self-heals. During the window: $ python -m mempalace repair-status [drawers] sqlite count: 1,408 hnsw count: (no flushed metadata yet) status: UNKNOWN CLI search (fresh process) on the same palace works — its freshly-loaded Chroma client opens segments after the flush has caught up. Change tool_search now wraps its search_memories call with a single retry that: - detects "Internal error" / "Error finding id" in the result dict via _is_transient_index_error(), - drops the MCP-local cache and _DEFAULT_BACKEND._clients / _freshness for the palace via _force_chroma_cache_reset(), - sleeps 2s, - retries once, - tags successful retries with "index_recovered": true so callers can observe when it kicked in. Non-transient errors (e.g. validation failures) bypass the retry path entirely. Tests Three unit tests added to TestSearchTool: - test_search_retries_once_on_hnsw_flush_transient — first call returns the transient, second succeeds, asserts retry ran and index_recovered is set. - test_search_does_not_retry_on_non_transient_error — unrelated errors propagate without retry. - test_search_returns_second_error_if_retry_also_fails — persistent transient surfaces the second error rather than looping. All 12 tests in TestSearchTool pass locally. What this does NOT fix - tool_check_duplicate and other index-touching tools still error in the same window. They'd need the same wrapper or, better, a shared helper. - The underlying flush window itself. Options 1-3 from #1315 (auto-flush at end of mine, fail-fast detection, SQLite-only fallback) are still on the table for a complete fix. This is option 4 — a low-risk recovery wrapper alongside, not a replacement. Refs #1315 Refs #1082 --- Happy to take this in a different direction (e.g. lift the helper into a shared decorator covering all index-touching tools) if you'd prefer. ---