Skip to content

fix: harden Chroma repair preflight and rollback recovery#1285

Merged
igorls merged 7 commits into
MemPalace:developfrom
mjc:hnsw-repair
May 7, 2026
Merged

fix: harden Chroma repair preflight and rollback recovery#1285
igorls merged 7 commits into
MemPalace:developfrom
mjc:hnsw-repair

Conversation

@mjc

@mjc mjc commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Hardens the Chroma repair path so a bad or partially corrupted palace does not get worse during startup or recovery.

This PR:

  • validates and quarantines invalid HNSW metadata before reopening a client
  • skips destructive quarantine for transient truncated pickle reads during reconnects
  • rebuilds collections through a temporary staging collection so a failed rebuild does not touch the live index
  • restores from backup if repair fails after the live collection has already been replaced
  • makes the CLI exit non-zero when repair recovery fails
  • preserves the original rebuild error if later cleanup also fails
  • closes the active backend before rollback restore so file handles are released
  • makes CLI repair restore use the live backend instance rather than a fresh helper instance

This follows up on PR #1124 by covering additional startup, reconnect, rebuild, and rollback failure modes found in practice, without pulling in that PR's unrelated Windows CLI encoding and status pagination changes.

How to test

  1. Start with a palace whose Chroma HNSW metadata is malformed or partially corrupted, then start MemPalace and confirm startup quarantines the invalid state instead of reopening it.
  2. Simulate a transient truncated pickle read during reconnect and confirm it is treated as a transient read failure rather than triggering quarantine.
  3. Force a rebuild failure while repairing a collection and confirm the live collection remains untouched because rebuild happens through a staging collection.
  4. Force a failure after live replacement during repair and confirm rollback restores the backup successfully.
  5. Run the CLI repair path through a recovery failure and confirm it exits non-zero instead of reporting success.

Copilot AI review requested due to automatic review settings April 30, 2026 15:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens MemPalace’s Chroma backend startup/repair paths by proactively quarantining invalid HNSW metadata before opening PersistentClient, and by making repair rebuilds safer via temporary staging + improved rollback/backup restore behavior.

Changes:

  • Add quarantine_invalid_hnsw_metadata() preflight (safe-unpickle + schema validation) and integrate it into Chroma client creation / cold-start quarantine gating.
  • Rework repair rebuild to stage via a temporary collection before replacing the live collection, with more explicit failure signaling (RebuildCollectionError.live_replaced).
  • Update CLI + tests to restore from backup on repair failures and to exit non-zero on repair failure.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
mempalace/backends/chroma.py Adds invalid-HNSW-metadata quarantine and wires it into client startup gating.
mempalace/repair.py Introduces temp-staged rebuild helpers + rollback/restore handling during rebuild failures.
mempalace/cli.py Uses shared rebuild helpers and restores from backup on repair failure (exit code 1).
tests/test_backends.py Adds coverage for invalid metadata quarantine behavior + client preflight ordering/gating.
tests/test_repair.py Adds coverage for staging rebuild behavior and multiple rollback scenarios.
tests/test_cli.py Adds coverage for CLI repair staging and backup restore on failure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/cli.py
Comment thread mempalace/repair.py Outdated
@mjc mjc changed the title Harden Chroma startup preflight and repair recovery fix: harden Chroma repair preflight and rollback recovery Apr 30, 2026
@igorls igorls added this to the v3.3.5 milestone May 1, 2026
mjc added 5 commits May 1, 2026 12:42
Rollback cleanup was instantiating a fresh ChromaBackend, so the live backend that had opened the PersistentClient could keep file handles alive during restore. Close the active backend instance instead so rollback and CLI recovery can release Windows-safe locks before copying the backup back into place.
@mjc

mjc commented May 1, 2026

Copy link
Copy Markdown
Contributor Author

Rebased onto current upstream/develop to pick up the recent HNSW threshold and Chroma reopen fixes. The only semantic conflict was the missing index_metadata.pickle path: I kept upstream’s dynamic sync_threshold scaling, but preserved this PR’s behavior that missing pickle metadata is inconclusive and should not disable vector search by itself.

potterdigital added a commit to potterdigital/mempalace that referenced this pull request May 2, 2026
Five small hardening fixes for the from-sqlite rebuild path, all from
mjc's review on MemPalace#1310:

- repair.py: drawers collection name now resolves from
  MempalaceConfig().collection_name via _drawers_collection_name() (closets
  stays fixed by design — AAAK index references drawer IDs by string).
  Lines up with the broader configured-collection work in MemPalace#1312 so that
  PR can rebase cleanly on top.
- repair.py: create_collection() moved inside the try block in
  _rebuild_one_collection so a Chroma "Collection already exists" failure
  surfaces as RebuildPartialError with archive_path, not an unstructured
  exception that strands the user without recovery instructions.
- repair.py: rebuild_from_sqlite wraps backend lifetime in try/finally
  with backend.close() so PersistentClient handles to dest_palace are
  released on every exit path. The from-sqlite path post-dates MemPalace#1285's
  lifecycle hardening of the legacy rebuild, so this needed its own
  cleanup.
- cli.py: cmd_repair (from-sqlite mode) now exits non-zero when
  rebuild_from_sqlite returns {} (validation refusal sentinel), so
  unattended scripts/CI distinguish "invalid inputs" from a successful
  rebuild that legitimately found zero rows.
- tests/test_repair.py: test_extract_via_sqlite_returns_all_rows_with_metadata
  now asserts every backing segment is scope='METADATA', locking in the
  segment-layout assumption against future regressions that point the
  JOIN at the VECTOR segment.

New test coverage:
- test_rebuild_from_sqlite_honors_configured_drawer_collection_name
- test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero
- test_cmd_repair_from_sqlite_success_does_not_exit

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igorls igorls added area/cli CLI commands bug Something isn't working storage labels May 2, 2026
igorls pushed a commit to potterdigital/mempalace that referenced this pull request May 6, 2026
Five small hardening fixes for the from-sqlite rebuild path, all from
mjc's review on MemPalace#1310:

- repair.py: drawers collection name now resolves from
  MempalaceConfig().collection_name via _drawers_collection_name() (closets
  stays fixed by design — AAAK index references drawer IDs by string).
  Lines up with the broader configured-collection work in MemPalace#1312 so that
  PR can rebase cleanly on top.
- repair.py: create_collection() moved inside the try block in
  _rebuild_one_collection so a Chroma "Collection already exists" failure
  surfaces as RebuildPartialError with archive_path, not an unstructured
  exception that strands the user without recovery instructions.
- repair.py: rebuild_from_sqlite wraps backend lifetime in try/finally
  with backend.close() so PersistentClient handles to dest_palace are
  released on every exit path. The from-sqlite path post-dates MemPalace#1285's
  lifecycle hardening of the legacy rebuild, so this needed its own
  cleanup.
- cli.py: cmd_repair (from-sqlite mode) now exits non-zero when
  rebuild_from_sqlite returns {} (validation refusal sentinel), so
  unattended scripts/CI distinguish "invalid inputs" from a successful
  rebuild that legitimately found zero rows.
- tests/test_repair.py: test_extract_via_sqlite_returns_all_rows_with_metadata
  now asserts every backing segment is scope='METADATA', locking in the
  segment-layout assumption against future regressions that point the
  JOIN at the VECTOR segment.

New test coverage:
- test_rebuild_from_sqlite_honors_configured_drawer_collection_name
- test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero
- test_cmd_repair_from_sqlite_success_does_not_exit

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… conflicts)

Develop (post-MemPalace#1162 lock-plumbing era) refactored the per-open quarantine
pass into ChromaBackend._prepare_palace_for_open. This branch's
inline-expansion form added quarantine_invalid_hnsw_metadata as a third
check, plus a "discard from _quarantined_paths on inode swap" guard so
re-opens against a different physical DB re-run quarantine.

Resolution merges both:

- _prepare_palace_for_open now also calls quarantine_invalid_hnsw_metadata,
  gated by the same _quarantined_paths set.
- _client keeps the inode_changed -> _quarantined_paths.discard() guard
  before calling the helper, so a fresh inode triggers a fresh pass.
- make_client collapses to a single _prepare_palace_for_open() call.
- test_backends.py keeps both the pickle (MemPalace#1285) and shutil (develop)
  imports — both are used.
@igorls

igorls commented May 7, 2026

Copy link
Copy Markdown
Member

Maintainer-edit pushed to keep this on track for v3.3.5:

  • Merged `develop` in. Two real conflicts: `mempalace/backends/chroma.py` (post-fix: serialize ChromaCollection writes through palace lock #1162 lock-plumbing era refactored the per-open pass into `ChromaBackend._prepare_palace_for_open`, which collided with this branch's inline-expansion form that added `quarantine_invalid_hnsw_metadata`) and a trivial import collision in `tests/test_backends.py`.
  • Resolution: kept the `_prepare_palace_for_open` helper from develop and folded `quarantine_invalid_hnsw_metadata` into it, gated by the same `_quarantined_paths` set. Kept the `if inode_changed: _quarantined_paths.discard` guard from this branch in `_client` so a fresh inode re-runs quarantine.
  • Linux full suite: 1579 passed, ruff lint+format clean against the 0.4.x CI pin.

Now also has `quarantine_invalid_hnsw_metadata` reachable from every open path (CLI mining, search, repair, status), not just `make_client`. Watching CI for the cross-platform jobs.

`backend: ChromaBackend | None = None` evaluates the X | None union
eagerly at function-definition time, which Python 3.9 rejects with
TypeError: unsupported operand type(s) for |: 'ABCMeta' and 'NoneType'
since the new union syntax is 3.10+. Quoting matches the existing
forward-reference style in repair.py (sqlite_drawer_count, etc.) and
defers evaluation, restoring 3.9 compatibility.
@igorls igorls merged commit 88493ac into MemPalace:develop May 7, 2026
6 checks passed
igorls added a commit to fatkobra/mempalace that referenced this pull request May 7, 2026
Three conflicts, all from develop landing MemPalace#1285/MemPalace#1310/MemPalace#1312 after this
branch was authored:

- mempalace/cli.py: keep both import sets — this branch's
  maybe_repair_poisoned_max_seq_id_before_rebuild plus develop's
  RebuildCollectionError / _close_chroma_handles / _extract_drawers /
  _rebuild_collection_via_temp added in MemPalace#1285.

- mempalace/repair.py: keep this branch's
  maybe_repair_poisoned_max_seq_id_before_rebuild definition; use
  develop's rebuild_index signature with the collection_name parameter
  added in MemPalace#1312. Normalized print indent to 2 spaces matching the
  rest of the file.

- tests/test_repair.py: keep both this branch's max_seq_id preflight
  tests and develop's rebuild_from_sqlite + configured-collection-name
  tests; they exercise distinct code paths and don't overlap.

Local: 1617 tests pass, ruff lint+format clean against 0.4.x CI pin.
igorls added a commit to fatkobra/mempalace that referenced this pull request May 7, 2026
Two conflicts, both because MemPalace#1339 (bloated link payloads) merged into
develop after this branch was authored:

- mempalace/backends/chroma.py: _segment_appears_healthy now stacks
  three checks — bloated-link from MemPalace#1339 (top), missing-metadata-with-
  data-floor from this branch (middle), pickle format sniff (bottom).
  All three are complementary; MemPalace#1339 catches structural payload
  corruption, this branch catches pickle truncation, the original
  catches pickle protocol-byte corruption.

- tests/test_backends.py: kept both new imports (_segment_appears_healthy
  from this branch, quarantine_invalid_hnsw_metadata from MemPalace#1285).

Local: 1618 tests pass, ruff lint+format clean against 0.4.x CI pin.
igorls added a commit to fatkobra/mempalace that referenced this pull request May 7, 2026
Conflicts opened by MemPalace#1285 (temp-staging rebuild) and MemPalace#1312
(collection_name in recovery paths) merging after this branch was
authored.

mempalace/repair.py:
- Kept this branch's sqlite_integrity_errors() and
  print_sqlite_integrity_abort() helpers; took develop's rebuild_index
  signature with the collection_name parameter from MemPalace#1312. Normalized
  the helper's print indent to 2 spaces to match the rest of the file.

tests/test_repair.py:
- Kept both this branch's sqlite_integrity_errors tests and develop's
  rebuild_from_sqlite + configured-collection coverage.
- Replaced 7 sites of sqlite_path.write_text("fake") with
  sqlite3.connect(...).close() — write_text("fake") fails PRAGMA
  quick_check, so the new preflight aborts before the rebuild logic
  the tests actually exercise. An empty real SQLite DB passes
  quick_check and lets the tests run as intended.
- Took develop's temp-staging assertion shape (delete/create the
  __repair_tmp collection in addition to the live drawers collection)
  for the existing test_rebuild_index_success test.

Local: 1618 tests pass, ruff lint+format clean against 0.4.x CI pin.
jphein added a commit to techempower-org/mempalace that referenced this pull request May 13, 2026
…tClient

Third bug found during the Phase 4.1 dry-run walkthrough: raw
chromadb.PersistentClient(path=...) SIGSEGVs when opening any long-lived
palace that has stale HNSW segments or invalid index_metadata.pickle
files. This is the exact issue the cascade work filed as
chroma-core/chroma#6949 and that mempalace works around via
ChromaBackend._prepare_palace_for_open.

The daemon serves the live palace fine because mempalace's ChromaBackend
calls _prepare_palace_for_open before every chromadb.PersistentClient()
construction. That function runs three preflight steps:

  1. _fix_blob_seq_ids — repairs BLOB seq_id quirk from older migrations
  2. quarantine_invalid_hnsw_metadata — renames aside bad index_metadata
     files (MemPalace#1266 / PR MemPalace#1285)
  3. quarantine_stale_hnsw — quarantines stale HNSW segments (MemPalace#1121,
     MemPalace#1132, MemPalace#1263 — the SIGSEGV prevention path)

The migration tool's phase_2_drawers was using raw chromadb directly,
inheriting all three risk classes. Now it calls
ChromaBackend._prepare_palace_for_open(chroma_path) before opening the
source. Idempotent — safe to call even on freshly-snapshotted palaces.

Without this fix, the migration crashes on every long-lived palace
(verified live on disks's 160K-drawer palace).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands bug Something isn't working storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants