Skip to content

fix: use configured collection in repair recovery paths#1312

Merged
igorls merged 2 commits into
MemPalace:developfrom
mjc:stale-chroma-reconnect
May 7, 2026
Merged

fix: use configured collection in repair recovery paths#1312
igorls merged 2 commits into
MemPalace:developfrom
mjc:stale-chroma-reconnect

Conversation

@mjc

@mjc mjc commented May 2, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR is intentionally stacked on #1285. The first five commits are from #1285 and are included only so this branch can be reviewed/tested against the repair staging and rollback implementation already queued for v3.3.5. They are not meant to be reviewed as new work here.

The new work in this PR is the final commit, 34282d0, which fixes configured drawer collection handling in recovery paths.

  • Uses MempalaceConfig().collection_name consistently across palace, MCP status/search fallback, repair, SQLite counts, truncation safety checks, and BM25 fallback paths.
  • Drops cached MemPalace/Chroma handles on MCP reconnect and reports reset failures.
  • Adds post-write readback validation for MCP drawer writes.
  • Scopes SQLite BM25 candidates to the configured collection and filtered wing/room before limiting.

Without the collection-name fix, a user with a non-default drawer collection could diagnose one collection, extract from another, or pass truncation checks against the wrong SQLite rows during recovery.

Once #1285 lands, this PR should rebase down to the single follow-up commit.

How to test

Focused regression coverage is in tests/test_mcp_server.py, tests/test_searcher.py, tests/test_repair.py, tests/test_cli.py, and tests/test_backends.py.

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check .)

Copilot AI review requested due to automatic review settings May 2, 2026 06:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens MemPalace’s ChromaDB recovery/repair and SQLite fallback paths by consistently using the configured drawers collection name (instead of assuming mempalace_drawers) and by adding additional safety checks around stale/damaged vector state.

Changes:

  • Default drawers collection resolution now uses MempalaceConfig().collection_name and threads that value through repair/status/BM25 fallback/reconnect flows.
  • SQLite BM25 fallback is scoped to the configured collection and applies wing/room scoping before candidate limiting.
  • Recovery paths are hardened: reconnect clears shared caches, drawer writes do a post-write readback check, and CLI repair restores from backup + exits non-zero on rebuild failure.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
mempalace/palace.py Makes get_collection() default to the configured collection name.
mempalace/searcher.py Adds collection scoping to BM25 SQLite fallback and threads collection_name through search_memories.
mempalace/repair.py Adds _get_collection_name() and uses it across scan/prune/status/rebuild + SQLite count safety checks.
mempalace/mcp_server.py Uses configured collection for capacity probes/search/status; adds post-write validation and more robust reconnect resets.
mempalace/cli.py Repair command now uses configured collection and restores backup + exits non-zero on rebuild failure.
mempalace/backends/chroma.py Adds invalid/truncated HNSW metadata quarantine and updates capacity status behavior for missing metadata.
tests/test_searcher.py Adds regression test ensuring vector path uses explicit collection_name.
tests/test_repair.py Adds coverage for configured-collection behavior in truncation safety, rebuild, and status.
tests/test_mcp_server.py Adds tests for post-write readback failure and reconnect cache-reset behavior.
tests/test_hnsw_capacity.py Expands BM25 fallback and capacity-status coverage including configured collection scoping.
tests/test_cli.py Adds tests for configured collection usage and backup-restore-on-failure behavior.
tests/test_backends.py Adds test coverage for invalid HNSW metadata quarantine and configured collection defaulting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/palace.py Outdated
Comment thread mempalace/searcher.py
@mjc mjc force-pushed the stale-chroma-reconnect branch 2 times, most recently from 19a8977 to 34282d0 Compare May 2, 2026 06:24
@mjc

mjc commented May 2, 2026

Copy link
Copy Markdown
Contributor Author

Addressed both config lookup comments in 34282d0.

DEFAULT_COLLECTION_NAME already lived in mempalace.config, so I added get_configured_collection_name() there as the shared cached resolver. palace.get_collection(), repair._get_collection_name(), and _bm25_only_via_sqlite() now use that helper instead of constructing MempalaceConfig() directly in fallback paths.

MCP still threads _config.collection_name explicitly where it already has the config object, so the cached helper is only for callers that do not already have one.

potterdigital added a commit to potterdigital/mempalace that referenced this pull request May 2, 2026
Five small hardening fixes for the from-sqlite rebuild path, all from
mjc's review on MemPalace#1310:

- repair.py: drawers collection name now resolves from
  MempalaceConfig().collection_name via _drawers_collection_name() (closets
  stays fixed by design — AAAK index references drawer IDs by string).
  Lines up with the broader configured-collection work in MemPalace#1312 so that
  PR can rebase cleanly on top.
- repair.py: create_collection() moved inside the try block in
  _rebuild_one_collection so a Chroma "Collection already exists" failure
  surfaces as RebuildPartialError with archive_path, not an unstructured
  exception that strands the user without recovery instructions.
- repair.py: rebuild_from_sqlite wraps backend lifetime in try/finally
  with backend.close() so PersistentClient handles to dest_palace are
  released on every exit path. The from-sqlite path post-dates MemPalace#1285's
  lifecycle hardening of the legacy rebuild, so this needed its own
  cleanup.
- cli.py: cmd_repair (from-sqlite mode) now exits non-zero when
  rebuild_from_sqlite returns {} (validation refusal sentinel), so
  unattended scripts/CI distinguish "invalid inputs" from a successful
  rebuild that legitimately found zero rows.
- tests/test_repair.py: test_extract_via_sqlite_returns_all_rows_with_metadata
  now asserts every backing segment is scope='METADATA', locking in the
  segment-layout assumption against future regressions that point the
  JOIN at the VECTOR segment.

New test coverage:
- test_rebuild_from_sqlite_honors_configured_drawer_collection_name
- test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero
- test_cmd_repair_from_sqlite_success_does_not_exit

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igorls igorls added area/cli CLI commands bug Something isn't working storage labels May 2, 2026
@igorls igorls added this to the v3.3.5 milestone May 2, 2026
igorls pushed a commit to potterdigital/mempalace that referenced this pull request May 6, 2026
Five small hardening fixes for the from-sqlite rebuild path, all from
mjc's review on MemPalace#1310:

- repair.py: drawers collection name now resolves from
  MempalaceConfig().collection_name via _drawers_collection_name() (closets
  stays fixed by design — AAAK index references drawer IDs by string).
  Lines up with the broader configured-collection work in MemPalace#1312 so that
  PR can rebase cleanly on top.
- repair.py: create_collection() moved inside the try block in
  _rebuild_one_collection so a Chroma "Collection already exists" failure
  surfaces as RebuildPartialError with archive_path, not an unstructured
  exception that strands the user without recovery instructions.
- repair.py: rebuild_from_sqlite wraps backend lifetime in try/finally
  with backend.close() so PersistentClient handles to dest_palace are
  released on every exit path. The from-sqlite path post-dates MemPalace#1285's
  lifecycle hardening of the legacy rebuild, so this needed its own
  cleanup.
- cli.py: cmd_repair (from-sqlite mode) now exits non-zero when
  rebuild_from_sqlite returns {} (validation refusal sentinel), so
  unattended scripts/CI distinguish "invalid inputs" from a successful
  rebuild that legitimately found zero rows.
- tests/test_repair.py: test_extract_via_sqlite_returns_all_rows_with_metadata
  now asserts every backing segment is scope='METADATA', locking in the
  segment-layout assumption against future regressions that point the
  JOIN at the VECTOR segment.

New test coverage:
- test_rebuild_from_sqlite_honors_configured_drawer_collection_name
- test_cmd_repair_from_sqlite_validation_refusal_exits_nonzero
- test_cmd_repair_from_sqlite_success_does_not_exit

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mjc and others added 2 commits May 7, 2026 09:10
The collection_name plumbing rebase produced a few unformatted blocks
in test_mcp_server.py and test_searcher.py; bringing them in line with
the 0.4.x CI pin so test-windows / lint stay green.
@igorls igorls force-pushed the stale-chroma-reconnect branch from 34282d0 to e9aee19 Compare May 7, 2026 12:10
@igorls

igorls commented May 7, 2026

Copy link
Copy Markdown
Member

Rebased onto develop tip post-#1285/#1310 merges (force-pushed with `--force-with-lease`).

Two non-trivial conflict resolutions, plus a format follow-up:

  1. `mempalace/repair.py` — develop merged fix: harden Chroma repair preflight and rollback recovery #1285 (which introduced `_drawers_collection_name()`) and fix(repair): add --mode from-sqlite to recover palaces with corrupt HNSW (#1308) #1310 (which introduced the `rebuild_from_sqlite` block). This branch's commit `34282d0` was authored against an earlier base and added a parallel helper `_get_collection_name()` that reads the same config through a different API. After rebase, both helpers existed in the working tree and the `status()` function appeared twice. Resolution:

    • Kept `_drawers_collection_name()` (the established name on develop) and removed the duplicate `_get_collection_name()`.
    • Replaced all `_get_collection_name()` call sites in this branch's added code with `_drawers_collection_name()`.
    • Replaced the hardcoded `"mempalace_closets"` literal in `status()` with `CLOSETS_COLLECTION_NAME` for consistency.
    • Updated `tests/test_repair.py` (4 tests) to patch `_drawers_collection_name` instead of the removed name.
  2. `mempalace/searcher.py` + `mempalace/palace.py` — small import / signature collisions; both new keyword-only parameters retained, both new imports retained.

  3. `tests/test_hnsw_capacity.py` — the PR's production change adds `_config.collection_name` access in `_tool_status_via_sqlite`, but the two pre-existing `_Cfg` mocks only set `palace_path`. Added `collection_name = "mempalace_drawers"` to both so the new dependency is exercised.

  4. Final `fix(tests): apply ruff format` commit covers blocks the rebase resolution left unformatted (`test_mcp_server.py`, `test_searcher.py`).

Local: 1603 tests pass, ruff lint + format clean against the 0.4.x CI pin. Watching CI.

@igorls igorls merged commit 72685f3 into MemPalace:develop May 7, 2026
6 checks passed
igorls added a commit to fatkobra/mempalace that referenced this pull request May 7, 2026
Three conflicts, all from develop landing MemPalace#1285/MemPalace#1310/MemPalace#1312 after this
branch was authored:

- mempalace/cli.py: keep both import sets — this branch's
  maybe_repair_poisoned_max_seq_id_before_rebuild plus develop's
  RebuildCollectionError / _close_chroma_handles / _extract_drawers /
  _rebuild_collection_via_temp added in MemPalace#1285.

- mempalace/repair.py: keep this branch's
  maybe_repair_poisoned_max_seq_id_before_rebuild definition; use
  develop's rebuild_index signature with the collection_name parameter
  added in MemPalace#1312. Normalized print indent to 2 spaces matching the
  rest of the file.

- tests/test_repair.py: keep both this branch's max_seq_id preflight
  tests and develop's rebuild_from_sqlite + configured-collection-name
  tests; they exercise distinct code paths and don't overlap.

Local: 1617 tests pass, ruff lint+format clean against 0.4.x CI pin.
igorls added a commit to fatkobra/mempalace that referenced this pull request May 7, 2026
Conflicts opened by MemPalace#1285 (temp-staging rebuild) and MemPalace#1312
(collection_name in recovery paths) merging after this branch was
authored.

mempalace/repair.py:
- Kept this branch's sqlite_integrity_errors() and
  print_sqlite_integrity_abort() helpers; took develop's rebuild_index
  signature with the collection_name parameter from MemPalace#1312. Normalized
  the helper's print indent to 2 spaces to match the rest of the file.

tests/test_repair.py:
- Kept both this branch's sqlite_integrity_errors tests and develop's
  rebuild_from_sqlite + configured-collection coverage.
- Replaced 7 sites of sqlite_path.write_text("fake") with
  sqlite3.connect(...).close() — write_text("fake") fails PRAGMA
  quick_check, so the new preflight aborts before the rebuild logic
  the tests actually exercise. An empty real SQLite DB passes
  quick_check and lets the tests run as intended.
- Took develop's temp-staging assertion shape (delete/create the
  __repair_tmp collection in addition to the live drawers collection)
  for the existing test_rebuild_index_success test.

Local: 1618 tests pass, ruff lint+format clean against 0.4.x CI pin.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands bug Something isn't working storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants