fix(repair): scale HNSW divergence floor with hnsw:sync_threshold#1287
Conversation
The capacity probe added in MemPalace#1227 hardcoded a 2,000-row floor for the "diverged" decision. The comment justifying that number explicitly tied it to chromadb's *default* sync_threshold of 1,000 — "Two synchronization windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling". MemPalace#1191 then bumped sync_threshold to 50,000 via _HNSW_BLOAT_GUARD without updating the floor. Result: any palace created with the bloat guard flips between OK and DIVERGED on every flush cycle. Steady-state divergence sits at 0–50K (the natural queue depth), and the 2,000 floor trips the guardrail the moment the queue exceeds 10% of sqlite_count. The MCP server then routes search to BM25-only and disables duplicate detection for ~80% of the write cycle on actively-mined ≥100K palaces, even though chromadb is behaving correctly. This change reads the configured `hnsw:sync_threshold` from `collection_metadata` per palace and scales the floor to 2 × that value. The 10% relative term and the original MemPalace#1222 detection capability are unchanged — a 91%-missing-of-192K palace (the actual MemPalace#1222 reproducer) still trips, regardless of whether the collection was created with sync_threshold=1000 or 50000. Behavior summary: | Collection's sync_threshold | New floor | Old floor | |---|---|---| | Missing (legacy palace) | 2000 | 2000 (unchanged) | 1000 (chromadb default) | 2000 | 2000 (unchanged) | 50000 (MemPalace#1191 bloat guard) | 100000 | 2000 (the bug) Tests: - test_capacity_status_tolerates_lag_under_large_sync_threshold (regression for the MemPalace#1191/MemPalace#1227 conflict — 100K sqlite + 50K HNSW + sync=50K → OK) - test_capacity_status_still_flags_real_corruption_under_large_sync (MemPalace#1222 shape with bloat-guard collection — still detects corruption) - test_capacity_status_default_threshold_when_no_sync_metadata (legacy palaces without the metadata row use the 2000 fallback floor) - test_unflushed_path_also_uses_dynamic_floor (the never-flushed branch scales too — 30K under sync_threshold=50000 is no longer flagged) All 18 pre-existing tests in tests/test_hnsw_capacity.py and 45 tests in tests/test_backends.py still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three fixes landed on develop after the initial release-prep cut and were brought in via the develop merge. Document them in the 3.3.4 Bug Fixes section so the release notes reflect what users will actually receive. - #1287 - HNSW divergence floor scales with hnsw:sync_threshold (resolves a silent-fallback regression introduced by the interaction between #1191 and #1227 in this release) - #1262 - ChromaBackend get_or_create_collection split, fixing the stop-hook SIGSEGV class on legacy palaces with mismatched stored metadata (#1089) - #1288 / #1254 - repair --mode max-seq-id heuristic now decodes BLOB-typed embeddings.seq_id, restoring the un-poison path added in #1135 for palaces where chromadb 1.5.x writes seq_ids natively
Catches up xdev-patches with 112 commits from MemPalace/develop, including: - v3.3.4 release - MemPalace#1262/MemPalace#1289 ChromaDB collection-reopen crash fix (relevant to long-running MCP server & mempalace-api) - MemPalace#1287 HNSW divergence floor fix - MemPalace#1288 BLOB seq_id decode in repair - MemPalace#1180 cross-wing tunnels by shared topics - MemPalace#1194 wing-slug normalization for hyphenated dirs Conflict resolution: hooks_cli.py and mcp_server.py both had local patches (6ef44cb route CC transcripts via convo_miner; 3fad61d allow leading dash) that overlap with upstream fixes (MemPalace#1231, MemPalace#1194). Took upstream entirely on those two files — upstream's version handles separate transcript/project ingest, uses _mempalace_python(), and adds _pin_hnsw_threads. The local config.py regex relaxation auto-merged cleanly and is preserved. Safety tag: pre-upstream-merge-20260501-091227 (rollback target). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…MemPalace#1367 (embeddings portion only) Cherry-picks the `_extract_drawers` embeddings-preserve fix from upstream PR MemPalace#1367 (@zhapostolski), per the cherry-pick evaluation in #31 and @messelink's review guidance: take the embeddings fix, SKIP the divergence-floor cap (which would re-create _refresh_vector_disabled_flag false-positives on our sync_threshold= 50000 palace per MemPalace#1287's design). What the embeddings-preserve fix does: - _extract_drawers now includes "embeddings" in the col.get() call - Returns a 4-tuple: ids, docs, metas, embeddings (or None when any embedding slot is missing — HNSW unreadable case) - _rebuild_collection_via_temp accepts an all_embeddings= kwarg and passes embeddings through to upsert() when available - rebuild_index threads embeddings from _extract_drawers into _rebuild_collection_via_temp When embeddings ARE available (the common case), rebuild skips ONNX recomputation entirely. On a 150K+ palace this cuts rebuild from hours to minutes. Fallback to recompute when any embedding is None (the bug condition we're trying to repair). Updates two call sites in mempalace/cli.py (the cmd_repair path) and mempalace/repair.py (the rebuild_index path) for the new 4-tuple signature. Adds 3 tests: - test_extract_drawers_preserves_embeddings_when_all_present - test_extract_drawers_falls_back_to_none_when_any_embedding_missing - test_extract_drawers_embeddings_absent_in_batch_signals_recompute Plus updates the existing 5 _extract_drawers tests to unpack the new 4-tuple. All 75 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks @messelink for the careful review on #1367 — exactly the kind of split-by-concern feedback that makes a "two bugs in one PR" situation tractable. For posterity: we followed your guidance on the cherry-pick. Took only the The fork-side context that confirms the warning was right for us:
#1287's design choice (scale divergence floor with sync_threshold) is the right one for any fork that adopted the high-sync_threshold mitigation. Glad to have it on our path. |
Summary
The capacity probe added in #1227 hardcoded a 2,000-row floor for the "DIVERGED" decision. The comment justifying that number explicitly tied it to chromadb's default
sync_thresholdof 1,000:#1191 then set
hnsw:sync_threshold = 50_000via_HNSW_BLOAT_GUARD(to prevent index bloat) without updating the divergence floor to track. The two changes shipped about a week apart and the dependency wasn't caught.Symptom
Any palace created via
ChromaBackend.create_collectionafter #1191 carrieshnsw:sync_threshold=50_000incollection_metadata. ChromaDB then naturally accumulates up to 50,000 entries in its WAL (embeddings_queue) between HNSW flushes — that's by design. The on-diskindex_metadata.pickleonly updates at flush time, so between flushes:sqlite_countgrows with every writehnsw_count(read from the pickle) stays frozen until the next flushWith the 2,000 floor, the divergence threshold becomes
max(2000, 10% × sqlite_count). For a 100K-drawer palace that's 10,000 — and the queue crosses 10K maybe a fifth of the way through every 50K-write cycle. Result: the MCP server (mcp_server._refresh_vector_disabled_flag) routes vector search to the BM25 fallback and disables duplicate detection for ~80% of the write cycle on any actively-mined ≥100K palace, even though chromadb is behaving correctly.Concretely on a 100K-drawer palace I just rebuilt:
The "49% invisible" framing is misleading: those 48,176 drawers are queued for the next HNSW flush, not corrupt. ChromaDB's WAL replay surfaces them on every fresh open —
count()returns the full 98,176, and a CLImempalace searchreturns hybrid (cosine + BM25) results across the full set. The MCP server's stricter guardrail is what disables vector search.Fix
Read the configured
hnsw:sync_thresholdfromcollection_metadataper palace and scale the floor to2 × sync_threshold. The 10% relative term, the unflushed-pickle branch, and the original #1222 detection capability are unchanged.A 91%-missing-of-192,997 palace (the actual #1222 reproducer) still trips, regardless of whether the collection was created with
sync_threshold=1000or50000.Behavior summary
hnsw:sync_threshold_HNSW_BLOAT_GUARD)Tests
Four new tests in
tests/test_hnsw_capacity.py:test_capacity_status_tolerates_lag_under_large_sync_threshold— regression for the fix: prevent HNSW index bloat from resize+persist cycles #1191/fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222) #1227 conflict: 100K sqlite + 50K HNSW +sync_threshold=50_000→OK. (Pre-fix: would flag DIVERGED.)test_capacity_status_still_flags_real_corruption_under_large_sync— HNSW max_elements frozen at 16K while collection grows to 100K+ entries — MCP server segfaults on every tool call #1222-shape (200K sqlite + 16K HNSW) on async_threshold=50_000collection still trips correctly.test_capacity_status_default_threshold_when_no_sync_metadata— legacy palaces without the metadata row use the 2,000 fallback floor (no behavior change).test_unflushed_path_also_uses_dynamic_floor— the never-flushed branch (no pickle yet) also scales: a 30K-drawer collection undersync_threshold=50_000is no longer flagged DIVERGED before its first flush._seed_chroma_dbin the test fixture grew an optionalsync_thresholdparameter that, when provided, seeds anhnsw:sync_thresholdrow into a newcollection_metadatatable. All 18 pre-existing tests intest_hnsw_capacity.pycontinue to pass; the broadertest_backends.py(45 tests) shows no regressions.Related
_HNSW_BLOAT_GUARDwithsync_threshold=50_000