fix: use configured collection in repair recovery paths#1312
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens MemPalace’s ChromaDB recovery/repair and SQLite fallback paths by consistently using the configured drawers collection name (instead of assuming mempalace_drawers) and by adding additional safety checks around stale/damaged vector state.
Changes:
- Default drawers collection resolution now uses
MempalaceConfig().collection_nameand threads that value through repair/status/BM25 fallback/reconnect flows. - SQLite BM25 fallback is scoped to the configured collection and applies wing/room scoping before candidate limiting.
- Recovery paths are hardened: reconnect clears shared caches, drawer writes do a post-write readback check, and CLI repair restores from backup + exits non-zero on rebuild failure.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
mempalace/palace.py |
Makes get_collection() default to the configured collection name. |
mempalace/searcher.py |
Adds collection scoping to BM25 SQLite fallback and threads collection_name through search_memories. |
mempalace/repair.py |
Adds _get_collection_name() and uses it across scan/prune/status/rebuild + SQLite count safety checks. |
mempalace/mcp_server.py |
Uses configured collection for capacity probes/search/status; adds post-write validation and more robust reconnect resets. |
mempalace/cli.py |
Repair command now uses configured collection and restores backup + exits non-zero on rebuild failure. |
mempalace/backends/chroma.py |
Adds invalid/truncated HNSW metadata quarantine and updates capacity status behavior for missing metadata. |
tests/test_searcher.py |
Adds regression test ensuring vector path uses explicit collection_name. |
tests/test_repair.py |
Adds coverage for configured-collection behavior in truncation safety, rebuild, and status. |
tests/test_mcp_server.py |
Adds tests for post-write readback failure and reconnect cache-reset behavior. |
tests/test_hnsw_capacity.py |
Expands BM25 fallback and capacity-status coverage including configured collection scoping. |
tests/test_cli.py |
Adds tests for configured collection usage and backup-restore-on-failure behavior. |
tests/test_backends.py |
Adds test coverage for invalid HNSW metadata quarantine and configured collection defaulting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
19a8977 to
34282d0
Compare
|
Addressed both config lookup comments in
MCP still threads |
Five small hardening fixes for the from-sqlite rebuild path, all from mjc's review on MemPalace#1310: - repair.py: drawers collection name now resolves from MempalaceConfig().collection_name via _drawers_collection_name() (closets stays fixed by design — AAAK index references drawer IDs by string). Lines up with the broader configured-collection work in MemPalace#1312 so that PR can rebase cleanly on top. - repair.py: create_collection() moved inside the try block in _rebuild_one_collection so a Chroma "Collection already exists" failure surfaces as RebuildPartialError with archive_path, not an unstructured exception that strands the user without recovery instructions. - repair.py: rebuild_from_sqlite wraps backend lifetime in try/finally with backend.close() so PersistentClient handles to dest_palace are released on every exit path. The from-sqlite path post-dates MemPalace#1285's lifecycle hardening of the legacy rebuild, so this needed its own cleanup. - cli.py: cmd_repair (from-sqlite mode) now exits non-zero when rebuild_from_sqlite returns {} (validation refusal sentinel), so unattended scripts/CI distinguish "invalid inputs" from a successful rebuild that legitimately found zero rows. - tests/test_repair.py: test_extract_via_sqlite_returns_all_rows_with_metadata now asserts every backing segment is scope='METADATA', locking in the segment-layout assumption against future regressions that point the JOIN at the VECTOR segment. New test coverage: - test_rebuild_from_sqlite_honors_configured_drawer_collection_name - test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero - test_cmd_repair_from_sqlite_success_does_not_exit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five small hardening fixes for the from-sqlite rebuild path, all from mjc's review on MemPalace#1310: - repair.py: drawers collection name now resolves from MempalaceConfig().collection_name via _drawers_collection_name() (closets stays fixed by design — AAAK index references drawer IDs by string). Lines up with the broader configured-collection work in MemPalace#1312 so that PR can rebase cleanly on top. - repair.py: create_collection() moved inside the try block in _rebuild_one_collection so a Chroma "Collection already exists" failure surfaces as RebuildPartialError with archive_path, not an unstructured exception that strands the user without recovery instructions. - repair.py: rebuild_from_sqlite wraps backend lifetime in try/finally with backend.close() so PersistentClient handles to dest_palace are released on every exit path. The from-sqlite path post-dates MemPalace#1285's lifecycle hardening of the legacy rebuild, so this needed its own cleanup. - cli.py: cmd_repair (from-sqlite mode) now exits non-zero when rebuild_from_sqlite returns {} (validation refusal sentinel), so unattended scripts/CI distinguish "invalid inputs" from a successful rebuild that legitimately found zero rows. - tests/test_repair.py: test_extract_via_sqlite_returns_all_rows_with_metadata now asserts every backing segment is scope='METADATA', locking in the segment-layout assumption against future regressions that point the JOIN at the VECTOR segment. New test coverage: - test_rebuild_from_sqlite_honors_configured_drawer_collection_name - test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero - test_cmd_repair_from_sqlite_success_does_not_exit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The collection_name plumbing rebase produced a few unformatted blocks in test_mcp_server.py and test_searcher.py; bringing them in line with the 0.4.x CI pin so test-windows / lint stay green.
34282d0 to
e9aee19
Compare
|
Rebased onto develop tip post-#1285/#1310 merges (force-pushed with `--force-with-lease`). Two non-trivial conflict resolutions, plus a format follow-up:
Local: 1603 tests pass, ruff lint + format clean against the 0.4.x CI pin. Watching CI. |
Three conflicts, all from develop landing MemPalace#1285/MemPalace#1310/MemPalace#1312 after this branch was authored: - mempalace/cli.py: keep both import sets — this branch's maybe_repair_poisoned_max_seq_id_before_rebuild plus develop's RebuildCollectionError / _close_chroma_handles / _extract_drawers / _rebuild_collection_via_temp added in MemPalace#1285. - mempalace/repair.py: keep this branch's maybe_repair_poisoned_max_seq_id_before_rebuild definition; use develop's rebuild_index signature with the collection_name parameter added in MemPalace#1312. Normalized print indent to 2 spaces matching the rest of the file. - tests/test_repair.py: keep both this branch's max_seq_id preflight tests and develop's rebuild_from_sqlite + configured-collection-name tests; they exercise distinct code paths and don't overlap. Local: 1617 tests pass, ruff lint+format clean against 0.4.x CI pin.
Conflicts opened by MemPalace#1285 (temp-staging rebuild) and MemPalace#1312 (collection_name in recovery paths) merging after this branch was authored. mempalace/repair.py: - Kept this branch's sqlite_integrity_errors() and print_sqlite_integrity_abort() helpers; took develop's rebuild_index signature with the collection_name parameter from MemPalace#1312. Normalized the helper's print indent to 2 spaces to match the rest of the file. tests/test_repair.py: - Kept both this branch's sqlite_integrity_errors tests and develop's rebuild_from_sqlite + configured-collection coverage. - Replaced 7 sites of sqlite_path.write_text("fake") with sqlite3.connect(...).close() — write_text("fake") fails PRAGMA quick_check, so the new preflight aborts before the rebuild logic the tests actually exercise. An empty real SQLite DB passes quick_check and lets the tests run as intended. - Took develop's temp-staging assertion shape (delete/create the __repair_tmp collection in addition to the live drawers collection) for the existing test_rebuild_index_success test. Local: 1618 tests pass, ruff lint+format clean against 0.4.x CI pin.
What does this PR do?
This PR is intentionally stacked on #1285. The first five commits are from #1285 and are included only so this branch can be reviewed/tested against the repair staging and rollback implementation already queued for v3.3.5. They are not meant to be reviewed as new work here.
The new work in this PR is the final commit,
34282d0, which fixes configured drawer collection handling in recovery paths.MempalaceConfig().collection_nameconsistently acrosspalace, MCP status/search fallback, repair, SQLite counts, truncation safety checks, and BM25 fallback paths.Without the collection-name fix, a user with a non-default drawer collection could diagnose one collection, extract from another, or pass truncation checks against the wrong SQLite rows during recovery.
Once #1285 lands, this PR should rebase down to the single follow-up commit.
How to test
Focused regression coverage is in
tests/test_mcp_server.py,tests/test_searcher.py,tests/test_repair.py,tests/test_cli.py, andtests/test_backends.py.Checklist
python -m pytest tests/ -v)ruff check .)