fix(repair): scale HNSW divergence floor with hnsw:sync_threshold by messelink · Pull Request #1287 · MemPalace/mempalace

messelink · 2026-05-01T00:32:31Z

Summary

The capacity probe added in #1227 hardcoded a 2,000-row floor for the "DIVERGED" decision. The comment justifying that number explicitly tied it to chromadb's default sync_threshold of 1,000:

Two synchronization windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling

#1191 then set hnsw:sync_threshold = 50_000 via _HNSW_BLOAT_GUARD (to prevent index bloat) without updating the divergence floor to track. The two changes shipped about a week apart and the dependency wasn't caught.

Symptom

Any palace created via ChromaBackend.create_collection after #1191 carries hnsw:sync_threshold=50_000 in collection_metadata. ChromaDB then naturally accumulates up to 50,000 entries in its WAL (embeddings_queue) between HNSW flushes — that's by design. The on-disk index_metadata.pickle only updates at flush time, so between flushes:

sqlite_count grows with every write
hnsw_count (read from the pickle) stays frozen until the next flush
Divergence climbs from 0 → ~50K → flush → back to 0 → repeat

With the 2,000 floor, the divergence threshold becomes max(2000, 10% × sqlite_count). For a 100K-drawer palace that's 10,000 — and the queue crosses 10K maybe a fifth of the way through every 50K-write cycle. Result: the MCP server (mcp_server._refresh_vector_disabled_flag) routes vector search to the BM25 fallback and disables duplicate detection for ~80% of the write cycle on any actively-mined ≥100K palace, even though chromadb is behaving correctly.

Concretely on a 100K-drawer palace I just rebuilt:

$ mempalace repair-status
  [drawers]
    sqlite count:   98,176
    hnsw count:     50,000
    divergence:     48,176
    status:         DIVERGED
    note:           HNSW index holds 50,000 elements but sqlite has 98,176
                    embeddings — 48,176 drawers (49%) are invisible to vector
                    search. Run `mempalace repair` to rebuild.

The "49% invisible" framing is misleading: those 48,176 drawers are queued for the next HNSW flush, not corrupt. ChromaDB's WAL replay surfaces them on every fresh open — count() returns the full 98,176, and a CLI mempalace search returns hybrid (cosine + BM25) results across the full set. The MCP server's stricter guardrail is what disables vector search.

Fix

Read the configured hnsw:sync_threshold from collection_metadata per palace and scale the floor to 2 × sync_threshold. The 10% relative term, the unflushed-pickle branch, and the original #1222 detection capability are unchanged.

sync_threshold = _read_sync_threshold(palace_path, collection_name)
divergence_floor = max(_HNSW_DIVERGENCE_FALLBACK_FLOOR, 2 * sync_threshold)
threshold = max(divergence_floor, int(sqlite_count * _HNSW_DIVERGENCE_FRACTION))

A 91%-missing-of-192,997 palace (the actual #1222 reproducer) still trips, regardless of whether the collection was created with sync_threshold=1000 or 50000.

Behavior summary

Collection's `hnsw:sync_threshold`	Floor (after fix)	Floor (before)
Missing (legacy palace from before #1191)	2,000	2,000 (unchanged)
1,000 (chromadb default)	2,000	2,000 (unchanged)
50,000 (`_HNSW_BLOAT_GUARD`)	100,000	2,000 (the bug)

Tests

Four new tests in tests/test_hnsw_capacity.py:

test_capacity_status_tolerates_lag_under_large_sync_threshold — regression for the fix: prevent HNSW index bloat from resize+persist cycles #1191/fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222) #1227 conflict: 100K sqlite + 50K HNSW + sync_threshold=50_000 → OK. (Pre-fix: would flag DIVERGED.)
test_capacity_status_still_flags_real_corruption_under_large_sync — HNSW max_elements frozen at 16K while collection grows to 100K+ entries — MCP server segfaults on every tool call #1222-shape (200K sqlite + 16K HNSW) on a sync_threshold=50_000 collection still trips correctly.
test_capacity_status_default_threshold_when_no_sync_metadata — legacy palaces without the metadata row use the 2,000 fallback floor (no behavior change).
test_unflushed_path_also_uses_dynamic_floor — the never-flushed branch (no pickle yet) also scales: a 30K-drawer collection under sync_threshold=50_000 is no longer flagged DIVERGED before its first flush.

_seed_chroma_db in the test fixture grew an optional sync_threshold parameter that, when provided, seeds an hnsw:sync_threshold row into a new collection_metadata table. All 18 pre-existing tests in test_hnsw_capacity.py continue to pass; the broader test_backends.py (45 tests) shows no regressions.

fix: prevent HNSW index bloat from resize+persist cycles #1191 — the PR that set _HNSW_BLOAT_GUARD with sync_threshold=50_000
fix(repair): detect HNSW capacity divergence and fall back to BM25 (#1222) #1227 — the PR that added the capacity divergence detection (and the hardcoded 2,000 floor this PR makes dynamic)
HNSW max_elements frozen at 16K while collection grows to 100K+ entries — MCP server segfaults on every tool call #1222 — the original failure mode the divergence detection was added to catch (still detected after this fix)

The capacity probe added in MemPalace#1227 hardcoded a 2,000-row floor for the "diverged" decision. The comment justifying that number explicitly tied it to chromadb's *default* sync_threshold of 1,000 — "Two synchronization windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling". MemPalace#1191 then bumped sync_threshold to 50,000 via _HNSW_BLOAT_GUARD without updating the floor. Result: any palace created with the bloat guard flips between OK and DIVERGED on every flush cycle. Steady-state divergence sits at 0–50K (the natural queue depth), and the 2,000 floor trips the guardrail the moment the queue exceeds 10% of sqlite_count. The MCP server then routes search to BM25-only and disables duplicate detection for ~80% of the write cycle on actively-mined ≥100K palaces, even though chromadb is behaving correctly. This change reads the configured `hnsw:sync_threshold` from `collection_metadata` per palace and scales the floor to 2 × that value. The 10% relative term and the original MemPalace#1222 detection capability are unchanged — a 91%-missing-of-192K palace (the actual MemPalace#1222 reproducer) still trips, regardless of whether the collection was created with sync_threshold=1000 or 50000. Behavior summary: | Collection's sync_threshold | New floor | Old floor | |---|---|---| | Missing (legacy palace) | 2000 | 2000 (unchanged) | 1000 (chromadb default) | 2000 | 2000 (unchanged) | 50000 (MemPalace#1191 bloat guard) | 100000 | 2000 (the bug) Tests: - test_capacity_status_tolerates_lag_under_large_sync_threshold (regression for the MemPalace#1191/MemPalace#1227 conflict — 100K sqlite + 50K HNSW + sync=50K → OK) - test_capacity_status_still_flags_real_corruption_under_large_sync (MemPalace#1222 shape with bloat-guard collection — still detects corruption) - test_capacity_status_default_threshold_when_no_sync_metadata (legacy palaces without the metadata row use the 2000 fallback floor) - test_unflushed_path_also_uses_dynamic_floor (the never-flushed branch scales too — 30K under sync_threshold=50000 is no longer flagged) All 18 pre-existing tests in tests/test_hnsw_capacity.py and 45 tests in tests/test_backends.py still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three fixes landed on develop after the initial release-prep cut and were brought in via the develop merge. Document them in the 3.3.4 Bug Fixes section so the release notes reflect what users will actually receive. - #1287 - HNSW divergence floor scales with hnsw:sync_threshold (resolves a silent-fallback regression introduced by the interaction between #1191 and #1227 in this release) - #1262 - ChromaBackend get_or_create_collection split, fixing the stop-hook SIGSEGV class on legacy palaces with mismatched stored metadata (#1089) - #1288 / #1254 - repair --mode max-seq-id heuristic now decodes BLOB-typed embeddings.seq_id, restoring the un-poison path added in #1135 for palaces where chromadb 1.5.x writes seq_ids natively

The release was originally cut on 2026-04-27 but did not tag that day. Three additional bug fixes have been folded in since then (#1262, #1287, #1288) and the actual tag will happen on 2026-04-30. Update the header date to match.

Catches up xdev-patches with 112 commits from MemPalace/develop, including: - v3.3.4 release - MemPalace#1262/MemPalace#1289 ChromaDB collection-reopen crash fix (relevant to long-running MCP server & mempalace-api) - MemPalace#1287 HNSW divergence floor fix - MemPalace#1288 BLOB seq_id decode in repair - MemPalace#1180 cross-wing tunnels by shared topics - MemPalace#1194 wing-slug normalization for hyphenated dirs Conflict resolution: hooks_cli.py and mcp_server.py both had local patches (6ef44cb route CC transcripts via convo_miner; 3fad61d allow leading dash) that overlap with upstream fixes (MemPalace#1231, MemPalace#1194). Took upstream entirely on those two files — upstream's version handles separate transcript/project ingest, uses _mempalace_python(), and adds _pin_hnsw_threads. The local config.py regex relaxation auto-merged cleanly and is preserved. Safety tag: pre-upstream-merge-20260501-091227 (rollback target). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@zhapostolski

…MemPalace#1367 (embeddings portion only) Cherry-picks the `_extract_drawers` embeddings-preserve fix from upstream PR MemPalace#1367 (@zhapostolski), per the cherry-pick evaluation in #31 and @messelink's review guidance: take the embeddings fix, SKIP the divergence-floor cap (which would re-create _refresh_vector_disabled_flag false-positives on our sync_threshold= 50000 palace per MemPalace#1287's design). What the embeddings-preserve fix does: - _extract_drawers now includes "embeddings" in the col.get() call - Returns a 4-tuple: ids, docs, metas, embeddings (or None when any embedding slot is missing — HNSW unreadable case) - _rebuild_collection_via_temp accepts an all_embeddings= kwarg and passes embeddings through to upsert() when available - rebuild_index threads embeddings from _extract_drawers into _rebuild_collection_via_temp When embeddings ARE available (the common case), rebuild skips ONNX recomputation entirely. On a 150K+ palace this cuts rebuild from hours to minutes. Fallback to recompute when any embedding is None (the bug condition we're trying to repair). Updates two call sites in mempalace/cli.py (the cmd_repair path) and mempalace/repair.py (the rebuild_index path) for the new 4-tuple signature. Adds 3 tests: - test_extract_drawers_preserves_embeddings_when_all_present - test_extract_drawers_falls_back_to_none_when_any_embedding_missing - test_extract_drawers_embeddings_absent_in_batch_signals_recompute Plus updates the existing 5 _extract_drawers tests to unpack the new 4-tuple. All 75 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jphein · 2026-05-13T22:38:01Z

Thanks @messelink for the careful review on #1367 — exactly the kind of split-by-concern feedback that makes a "two bugs in one PR" situation tractable.

For posterity: we followed your guidance on the cherry-pick. Took only the _extract_drawers embeddings-preserve fix into our fork (jphein/mempalace@aa84fa4), explicitly skipping the divergence-floor cap. Our sync_threshold=50000 palace would have hit the exact false-positive regime your #1287 was designed to prevent — divergence_floor=10000 flagging DIVERGED on every flush-window in the steady state.

The fork-side context that confirms the warning was right for us:

We carry the _HNSW_BLOAT_GUARD = {batch_size: 50000, sync_threshold: 50000} from PR fix: prevent HNSW index bloat from resize+persist cycles #1191 (intentional, per backends/chroma.py:100 — prevents link_lists.bin bloat that would recur with smaller sync_thresholds).
Hook-driven write rate from 5+ concurrent Claude Code sessions means a normal flush window has 5K–30K queue depth. divergence_floor=10000 would trip routinely there.

#1287's design choice (scale divergence floor with sync_threshold) is the right one for any fork that adopted the high-sync_threshold mitigation. Glad to have it on our path.

messelink requested review from bensig, igorls and milla-jovovich as code owners May 1, 2026 00:32

igorls merged commit 96bb80a into MemPalace:develop May 1, 2026
6 checks passed

briankingdomlight mentioned this pull request May 2, 2026

fix(repair): rebuild_index bails on col.count() — exactly the call HNSW corruption breaks #1307

Closed

potterdigital mentioned this pull request May 2, 2026

fix(repair): rebuild_index bails on col.count() — exactly the call HNSW corruption breaks #1308

Closed

This was referenced May 8, 2026

fix(repair): preserve embeddings on rebuild + cap divergence-floor #1367

Open

Evaluate cherry-pick of upstream #1367 (preserve embeddings on rebuild) techempower-org/mempalace#31

Closed

jphein added area/cli CLI commands bug Something isn't working performance Performance improvements storage labels May 13, 2026

messelink mentioned this pull request May 18, 2026

repair --mode from-sqlite is structurally broken at scale: auto-quarantine destroys rebuild progress mid-run #1545

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(repair): scale HNSW divergence floor with hnsw:sync_threshold#1287

fix(repair): scale HNSW divergence floor with hnsw:sync_threshold#1287
igorls merged 1 commit into
MemPalace:developfrom
messelink:fix/hnsw-divergence-scales-with-sync-threshold

messelink commented May 1, 2026

Uh oh!

Uh oh!

jphein commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

messelink commented May 1, 2026

Summary

Symptom

Fix

Behavior summary

Tests

Related

Uh oh!

Uh oh!

jphein commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants