Skip to content

fix(repair): scale HNSW divergence floor with hnsw:sync_threshold#1287

Merged
igorls merged 1 commit into
MemPalace:developfrom
messelink:fix/hnsw-divergence-scales-with-sync-threshold
May 1, 2026
Merged

fix(repair): scale HNSW divergence floor with hnsw:sync_threshold#1287
igorls merged 1 commit into
MemPalace:developfrom
messelink:fix/hnsw-divergence-scales-with-sync-threshold

Conversation

@messelink

Copy link
Copy Markdown
Contributor

Summary

The capacity probe added in #1227 hardcoded a 2,000-row floor for the "DIVERGED" decision. The comment justifying that number explicitly tied it to chromadb's default sync_threshold of 1,000:

Two synchronization windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling

#1191 then set hnsw:sync_threshold = 50_000 via _HNSW_BLOAT_GUARD (to prevent index bloat) without updating the divergence floor to track. The two changes shipped about a week apart and the dependency wasn't caught.

Symptom

Any palace created via ChromaBackend.create_collection after #1191 carries hnsw:sync_threshold=50_000 in collection_metadata. ChromaDB then naturally accumulates up to 50,000 entries in its WAL (embeddings_queue) between HNSW flushes — that's by design. The on-disk index_metadata.pickle only updates at flush time, so between flushes:

  • sqlite_count grows with every write
  • hnsw_count (read from the pickle) stays frozen until the next flush
  • Divergence climbs from 0 → ~50K → flush → back to 0 → repeat

With the 2,000 floor, the divergence threshold becomes max(2000, 10% × sqlite_count). For a 100K-drawer palace that's 10,000 — and the queue crosses 10K maybe a fifth of the way through every 50K-write cycle. Result: the MCP server (mcp_server._refresh_vector_disabled_flag) routes vector search to the BM25 fallback and disables duplicate detection for ~80% of the write cycle on any actively-mined ≥100K palace, even though chromadb is behaving correctly.

Concretely on a 100K-drawer palace I just rebuilt:

$ mempalace repair-status
  [drawers]
    sqlite count:   98,176
    hnsw count:     50,000
    divergence:     48,176
    status:         DIVERGED
    note:           HNSW index holds 50,000 elements but sqlite has 98,176
                    embeddings — 48,176 drawers (49%) are invisible to vector
                    search. Run `mempalace repair` to rebuild.

The "49% invisible" framing is misleading: those 48,176 drawers are queued for the next HNSW flush, not corrupt. ChromaDB's WAL replay surfaces them on every fresh open — count() returns the full 98,176, and a CLI mempalace search returns hybrid (cosine + BM25) results across the full set. The MCP server's stricter guardrail is what disables vector search.

Fix

Read the configured hnsw:sync_threshold from collection_metadata per palace and scale the floor to 2 × sync_threshold. The 10% relative term, the unflushed-pickle branch, and the original #1222 detection capability are unchanged.

sync_threshold = _read_sync_threshold(palace_path, collection_name)
divergence_floor = max(_HNSW_DIVERGENCE_FALLBACK_FLOOR, 2 * sync_threshold)
threshold = max(divergence_floor, int(sqlite_count * _HNSW_DIVERGENCE_FRACTION))

A 91%-missing-of-192,997 palace (the actual #1222 reproducer) still trips, regardless of whether the collection was created with sync_threshold=1000 or 50000.

Behavior summary

Collection's hnsw:sync_threshold Floor (after fix) Floor (before)
Missing (legacy palace from before #1191) 2,000 2,000 (unchanged)
1,000 (chromadb default) 2,000 2,000 (unchanged)
50,000 (_HNSW_BLOAT_GUARD) 100,000 2,000 (the bug)

Tests

Four new tests in tests/test_hnsw_capacity.py:

_seed_chroma_db in the test fixture grew an optional sync_threshold parameter that, when provided, seeds an hnsw:sync_threshold row into a new collection_metadata table. All 18 pre-existing tests in test_hnsw_capacity.py continue to pass; the broader test_backends.py (45 tests) shows no regressions.

Related

The capacity probe added in MemPalace#1227 hardcoded a 2,000-row floor for the
"diverged" decision. The comment justifying that number explicitly tied
it to chromadb's *default* sync_threshold of 1,000 — "Two synchronization
windows worth (2 × sync_threshold = 2000) is a safe steady-state ceiling".

MemPalace#1191 then bumped sync_threshold to 50,000 via _HNSW_BLOAT_GUARD without
updating the floor. Result: any palace created with the bloat guard
flips between OK and DIVERGED on every flush cycle. Steady-state
divergence sits at 0–50K (the natural queue depth), and the 2,000 floor
trips the guardrail the moment the queue exceeds 10% of sqlite_count.
The MCP server then routes search to BM25-only and disables duplicate
detection for ~80% of the write cycle on actively-mined ≥100K palaces,
even though chromadb is behaving correctly.

This change reads the configured `hnsw:sync_threshold` from
`collection_metadata` per palace and scales the floor to 2 × that value.
The 10% relative term and the original MemPalace#1222 detection capability are
unchanged — a 91%-missing-of-192K palace (the actual MemPalace#1222 reproducer)
still trips, regardless of whether the collection was created with
sync_threshold=1000 or 50000.

Behavior summary:

| Collection's sync_threshold | New floor | Old floor |
|---|---|---|
| Missing (legacy palace)     | 2000      | 2000 (unchanged)
| 1000 (chromadb default)     | 2000      | 2000 (unchanged)
| 50000 (MemPalace#1191 bloat guard)   | 100000    | 2000 (the bug)

Tests:
- test_capacity_status_tolerates_lag_under_large_sync_threshold (regression
  for the MemPalace#1191/MemPalace#1227 conflict — 100K sqlite + 50K HNSW + sync=50K → OK)
- test_capacity_status_still_flags_real_corruption_under_large_sync (MemPalace#1222
  shape with bloat-guard collection — still detects corruption)
- test_capacity_status_default_threshold_when_no_sync_metadata (legacy
  palaces without the metadata row use the 2000 fallback floor)
- test_unflushed_path_also_uses_dynamic_floor (the never-flushed branch
  scales too — 30K under sync_threshold=50000 is no longer flagged)

All 18 pre-existing tests in tests/test_hnsw_capacity.py and 45 tests
in tests/test_backends.py still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igorls igorls merged commit 96bb80a into MemPalace:develop May 1, 2026
6 checks passed
igorls added a commit that referenced this pull request May 1, 2026
Three fixes landed on develop after the initial release-prep cut and
were brought in via the develop merge. Document them in the 3.3.4
Bug Fixes section so the release notes reflect what users will
actually receive.

- #1287 - HNSW divergence floor scales with hnsw:sync_threshold
  (resolves a silent-fallback regression introduced by the
  interaction between #1191 and #1227 in this release)
- #1262 - ChromaBackend get_or_create_collection split, fixing the
  stop-hook SIGSEGV class on legacy palaces with mismatched stored
  metadata (#1089)
- #1288 / #1254 - repair --mode max-seq-id heuristic now decodes
  BLOB-typed embeddings.seq_id, restoring the un-poison path added
  in #1135 for palaces where chromadb 1.5.x writes seq_ids natively
igorls added a commit that referenced this pull request May 1, 2026
The release was originally cut on 2026-04-27 but did not tag that day.
Three additional bug fixes have been folded in since then (#1262,
#1287, #1288) and the actual tag will happen on 2026-04-30. Update
the header date to match.
xcarbo added a commit to xcarbo/mempalace that referenced this pull request May 1, 2026
Catches up xdev-patches with 112 commits from MemPalace/develop, including:
- v3.3.4 release
- MemPalace#1262/MemPalace#1289 ChromaDB collection-reopen crash fix (relevant to long-running
  MCP server & mempalace-api)
- MemPalace#1287 HNSW divergence floor fix
- MemPalace#1288 BLOB seq_id decode in repair
- MemPalace#1180 cross-wing tunnels by shared topics
- MemPalace#1194 wing-slug normalization for hyphenated dirs

Conflict resolution: hooks_cli.py and mcp_server.py both had local patches
(6ef44cb route CC transcripts via convo_miner; 3fad61d allow leading dash)
that overlap with upstream fixes (MemPalace#1231, MemPalace#1194). Took upstream entirely on
those two files — upstream's version handles separate transcript/project
ingest, uses _mempalace_python(), and adds _pin_hnsw_threads. The local
config.py regex relaxation auto-merged cleanly and is preserved.

Safety tag: pre-upstream-merge-20260501-091227 (rollback target).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jphein added a commit to techempower-org/mempalace that referenced this pull request May 13, 2026
…MemPalace#1367 (embeddings portion only)

Cherry-picks the `_extract_drawers` embeddings-preserve fix from upstream
PR MemPalace#1367 (@zhapostolski), per the cherry-pick evaluation in
#31 and @messelink's review guidance: take the
embeddings fix, SKIP the divergence-floor cap (which would re-create
_refresh_vector_disabled_flag false-positives on our sync_threshold=
50000 palace per MemPalace#1287's design).

What the embeddings-preserve fix does:
- _extract_drawers now includes "embeddings" in the col.get() call
- Returns a 4-tuple: ids, docs, metas, embeddings (or None when any
  embedding slot is missing — HNSW unreadable case)
- _rebuild_collection_via_temp accepts an all_embeddings= kwarg and
  passes embeddings through to upsert() when available
- rebuild_index threads embeddings from _extract_drawers into
  _rebuild_collection_via_temp

When embeddings ARE available (the common case), rebuild skips ONNX
recomputation entirely. On a 150K+ palace this cuts rebuild from hours
to minutes. Fallback to recompute when any embedding is None (the bug
condition we're trying to repair).

Updates two call sites in mempalace/cli.py (the cmd_repair path) and
mempalace/repair.py (the rebuild_index path) for the new 4-tuple
signature.

Adds 3 tests:
- test_extract_drawers_preserves_embeddings_when_all_present
- test_extract_drawers_falls_back_to_none_when_any_embedding_missing
- test_extract_drawers_embeddings_absent_in_batch_signals_recompute

Plus updates the existing 5 _extract_drawers tests to unpack the new
4-tuple. All 75 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jphein

jphein commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Thanks @messelink for the careful review on #1367 — exactly the kind of split-by-concern feedback that makes a "two bugs in one PR" situation tractable.

For posterity: we followed your guidance on the cherry-pick. Took only the _extract_drawers embeddings-preserve fix into our fork (jphein/mempalace@aa84fa4), explicitly skipping the divergence-floor cap. Our sync_threshold=50000 palace would have hit the exact false-positive regime your #1287 was designed to prevent — divergence_floor=10000 flagging DIVERGED on every flush-window in the steady state.

The fork-side context that confirms the warning was right for us:

  • We carry the _HNSW_BLOAT_GUARD = {batch_size: 50000, sync_threshold: 50000} from PR fix: prevent HNSW index bloat from resize+persist cycles #1191 (intentional, per backends/chroma.py:100 — prevents link_lists.bin bloat that would recur with smaller sync_thresholds).
  • Hook-driven write rate from 5+ concurrent Claude Code sessions means a normal flush window has 5K–30K queue depth. divergence_floor=10000 would trip routinely there.

#1287's design choice (scale divergence floor with sync_threshold) is the right one for any fork that adopted the high-sync_threshold mitigation. Glad to have it on our path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands bug Something isn't working performance Performance improvements storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants