fix: harden Chroma repair preflight and rollback recovery#1285
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens MemPalace’s Chroma backend startup/repair paths by proactively quarantining invalid HNSW metadata before opening PersistentClient, and by making repair rebuilds safer via temporary staging + improved rollback/backup restore behavior.
Changes:
- Add
quarantine_invalid_hnsw_metadata()preflight (safe-unpickle + schema validation) and integrate it into Chroma client creation / cold-start quarantine gating. - Rework repair rebuild to stage via a temporary collection before replacing the live collection, with more explicit failure signaling (
RebuildCollectionError.live_replaced). - Update CLI + tests to restore from backup on repair failures and to exit non-zero on repair failure.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
mempalace/backends/chroma.py |
Adds invalid-HNSW-metadata quarantine and wires it into client startup gating. |
mempalace/repair.py |
Introduces temp-staged rebuild helpers + rollback/restore handling during rebuild failures. |
mempalace/cli.py |
Uses shared rebuild helpers and restores from backup on repair failure (exit code 1). |
tests/test_backends.py |
Adds coverage for invalid metadata quarantine behavior + client preflight ordering/gating. |
tests/test_repair.py |
Adds coverage for staging rebuild behavior and multiple rollback scenarios. |
tests/test_cli.py |
Adds coverage for CLI repair staging and backup restore on failure. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Rollback cleanup was instantiating a fresh ChromaBackend, so the live backend that had opened the PersistentClient could keep file handles alive during restore. Close the active backend instance instead so rollback and CLI recovery can release Windows-safe locks before copying the backup back into place.
|
Rebased onto current upstream/develop to pick up the recent HNSW threshold and Chroma reopen fixes. The only semantic conflict was the missing index_metadata.pickle path: I kept upstream’s dynamic sync_threshold scaling, but preserved this PR’s behavior that missing pickle metadata is inconclusive and should not disable vector search by itself. |
Five small hardening fixes for the from-sqlite rebuild path, all from mjc's review on MemPalace#1310: - repair.py: drawers collection name now resolves from MempalaceConfig().collection_name via _drawers_collection_name() (closets stays fixed by design — AAAK index references drawer IDs by string). Lines up with the broader configured-collection work in MemPalace#1312 so that PR can rebase cleanly on top. - repair.py: create_collection() moved inside the try block in _rebuild_one_collection so a Chroma "Collection already exists" failure surfaces as RebuildPartialError with archive_path, not an unstructured exception that strands the user without recovery instructions. - repair.py: rebuild_from_sqlite wraps backend lifetime in try/finally with backend.close() so PersistentClient handles to dest_palace are released on every exit path. The from-sqlite path post-dates MemPalace#1285's lifecycle hardening of the legacy rebuild, so this needed its own cleanup. - cli.py: cmd_repair (from-sqlite mode) now exits non-zero when rebuild_from_sqlite returns {} (validation refusal sentinel), so unattended scripts/CI distinguish "invalid inputs" from a successful rebuild that legitimately found zero rows. - tests/test_repair.py: test_extract_via_sqlite_returns_all_rows_with_metadata now asserts every backing segment is scope='METADATA', locking in the segment-layout assumption against future regressions that point the JOIN at the VECTOR segment. New test coverage: - test_rebuild_from_sqlite_honors_configured_drawer_collection_name - test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero - test_cmd_repair_from_sqlite_success_does_not_exit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five small hardening fixes for the from-sqlite rebuild path, all from mjc's review on MemPalace#1310: - repair.py: drawers collection name now resolves from MempalaceConfig().collection_name via _drawers_collection_name() (closets stays fixed by design — AAAK index references drawer IDs by string). Lines up with the broader configured-collection work in MemPalace#1312 so that PR can rebase cleanly on top. - repair.py: create_collection() moved inside the try block in _rebuild_one_collection so a Chroma "Collection already exists" failure surfaces as RebuildPartialError with archive_path, not an unstructured exception that strands the user without recovery instructions. - repair.py: rebuild_from_sqlite wraps backend lifetime in try/finally with backend.close() so PersistentClient handles to dest_palace are released on every exit path. The from-sqlite path post-dates MemPalace#1285's lifecycle hardening of the legacy rebuild, so this needed its own cleanup. - cli.py: cmd_repair (from-sqlite mode) now exits non-zero when rebuild_from_sqlite returns {} (validation refusal sentinel), so unattended scripts/CI distinguish "invalid inputs" from a successful rebuild that legitimately found zero rows. - tests/test_repair.py: test_extract_via_sqlite_returns_all_rows_with_metadata now asserts every backing segment is scope='METADATA', locking in the segment-layout assumption against future regressions that point the JOIN at the VECTOR segment. New test coverage: - test_rebuild_from_sqlite_honors_configured_drawer_collection_name - test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero - test_cmd_repair_from_sqlite_success_does_not_exit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… conflicts) Develop (post-MemPalace#1162 lock-plumbing era) refactored the per-open quarantine pass into ChromaBackend._prepare_palace_for_open. This branch's inline-expansion form added quarantine_invalid_hnsw_metadata as a third check, plus a "discard from _quarantined_paths on inode swap" guard so re-opens against a different physical DB re-run quarantine. Resolution merges both: - _prepare_palace_for_open now also calls quarantine_invalid_hnsw_metadata, gated by the same _quarantined_paths set. - _client keeps the inode_changed -> _quarantined_paths.discard() guard before calling the helper, so a fresh inode triggers a fresh pass. - make_client collapses to a single _prepare_palace_for_open() call. - test_backends.py keeps both the pickle (MemPalace#1285) and shutil (develop) imports — both are used.
|
Maintainer-edit pushed to keep this on track for v3.3.5:
Now also has `quarantine_invalid_hnsw_metadata` reachable from every open path (CLI mining, search, repair, status), not just `make_client`. Watching CI for the cross-platform jobs. |
`backend: ChromaBackend | None = None` evaluates the X | None union eagerly at function-definition time, which Python 3.9 rejects with TypeError: unsupported operand type(s) for |: 'ABCMeta' and 'NoneType' since the new union syntax is 3.10+. Quoting matches the existing forward-reference style in repair.py (sqlite_drawer_count, etc.) and defers evaluation, restoring 3.9 compatibility.
Three conflicts, all from develop landing MemPalace#1285/MemPalace#1310/MemPalace#1312 after this branch was authored: - mempalace/cli.py: keep both import sets — this branch's maybe_repair_poisoned_max_seq_id_before_rebuild plus develop's RebuildCollectionError / _close_chroma_handles / _extract_drawers / _rebuild_collection_via_temp added in MemPalace#1285. - mempalace/repair.py: keep this branch's maybe_repair_poisoned_max_seq_id_before_rebuild definition; use develop's rebuild_index signature with the collection_name parameter added in MemPalace#1312. Normalized print indent to 2 spaces matching the rest of the file. - tests/test_repair.py: keep both this branch's max_seq_id preflight tests and develop's rebuild_from_sqlite + configured-collection-name tests; they exercise distinct code paths and don't overlap. Local: 1617 tests pass, ruff lint+format clean against 0.4.x CI pin.
Two conflicts, both because MemPalace#1339 (bloated link payloads) merged into develop after this branch was authored: - mempalace/backends/chroma.py: _segment_appears_healthy now stacks three checks — bloated-link from MemPalace#1339 (top), missing-metadata-with- data-floor from this branch (middle), pickle format sniff (bottom). All three are complementary; MemPalace#1339 catches structural payload corruption, this branch catches pickle truncation, the original catches pickle protocol-byte corruption. - tests/test_backends.py: kept both new imports (_segment_appears_healthy from this branch, quarantine_invalid_hnsw_metadata from MemPalace#1285). Local: 1618 tests pass, ruff lint+format clean against 0.4.x CI pin.
Conflicts opened by MemPalace#1285 (temp-staging rebuild) and MemPalace#1312 (collection_name in recovery paths) merging after this branch was authored. mempalace/repair.py: - Kept this branch's sqlite_integrity_errors() and print_sqlite_integrity_abort() helpers; took develop's rebuild_index signature with the collection_name parameter from MemPalace#1312. Normalized the helper's print indent to 2 spaces to match the rest of the file. tests/test_repair.py: - Kept both this branch's sqlite_integrity_errors tests and develop's rebuild_from_sqlite + configured-collection coverage. - Replaced 7 sites of sqlite_path.write_text("fake") with sqlite3.connect(...).close() — write_text("fake") fails PRAGMA quick_check, so the new preflight aborts before the rebuild logic the tests actually exercise. An empty real SQLite DB passes quick_check and lets the tests run as intended. - Took develop's temp-staging assertion shape (delete/create the __repair_tmp collection in addition to the live drawers collection) for the existing test_rebuild_index_success test. Local: 1618 tests pass, ruff lint+format clean against 0.4.x CI pin.
…tClient Third bug found during the Phase 4.1 dry-run walkthrough: raw chromadb.PersistentClient(path=...) SIGSEGVs when opening any long-lived palace that has stale HNSW segments or invalid index_metadata.pickle files. This is the exact issue the cascade work filed as chroma-core/chroma#6949 and that mempalace works around via ChromaBackend._prepare_palace_for_open. The daemon serves the live palace fine because mempalace's ChromaBackend calls _prepare_palace_for_open before every chromadb.PersistentClient() construction. That function runs three preflight steps: 1. _fix_blob_seq_ids — repairs BLOB seq_id quirk from older migrations 2. quarantine_invalid_hnsw_metadata — renames aside bad index_metadata files (MemPalace#1266 / PR MemPalace#1285) 3. quarantine_stale_hnsw — quarantines stale HNSW segments (MemPalace#1121, MemPalace#1132, MemPalace#1263 — the SIGSEGV prevention path) The migration tool's phase_2_drawers was using raw chromadb directly, inheriting all three risk classes. Now it calls ChromaBackend._prepare_palace_for_open(chroma_path) before opening the source. Idempotent — safe to call even on freshly-snapshotted palaces. Without this fix, the migration crashes on every long-lived palace (verified live on disks's 160K-drawer palace). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
What does this PR do?
Hardens the Chroma repair path so a bad or partially corrupted palace does not get worse during startup or recovery.
This PR:
This follows up on PR #1124 by covering additional startup, reconnect, rebuild, and rollback failure modes found in practice, without pulling in that PR's unrelated Windows CLI encoding and status pagination changes.
How to test