Skip to content

fix: prevent HNSW index bloat from duplicate add() calls (#525)#544

Merged
bensig merged 9 commits into
mainfrom
fix/525-hnsw-bloat-dedup
Apr 10, 2026
Merged

fix: prevent HNSW index bloat from duplicate add() calls (#525)#544
bensig merged 9 commits into
mainfrom
fix/525-hnsw-bloat-dedup

Conversation

@milla-jovovich

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: convo_miner.py used collection.add() instead of upsert(), so repeated mine runs pushed duplicate entries into the HNSW graph. At scale (50K+ drawers) this causes link_lists.bin to grow to terabytes and eventually segfault.
  • convo_miner.py: one-line fix, add()upsert()
  • repair.py: new module — scan for corrupt IDs, prune them, or rebuild the HNSW index from scratch. Backs up only chroma.sqlite3 (not the bloated HNSW files). Recreates collection with hnsw:space=cosine.
  • dedup.py: new module — detect and remove near-duplicate drawers from the same source file using cosine similarity. No API calls.

Test plan

  • Run mempalace mine twice on the same directory, confirm no duplicate drawers created
  • Run python -m mempalace.repair scan on a palace with known corrupt IDs
  • Run python -m mempalace.dedup --dry-run to verify duplicate detection without deletion
  • Verify rebuild backs up only chroma.sqlite3, not HNSW files

Fixes #525

Thank you from Milla and Lu ✨

Root cause: convo_miner.py used collection.add() instead of upsert(),
so repeated mine runs pushed duplicate entries into the HNSW graph.
At scale (50K+ drawers) this causes link_lists.bin to grow to terabytes
and eventually segfault.

Changes:
- convo_miner.py: add() → upsert() (the one-line root cause fix)
- repair.py: new module — scan for corrupt IDs, prune them, or rebuild
  the HNSW index from scratch. Backs up only chroma.sqlite3 (not the
  bloated HNSW files). Recreates collection with hnsw:space=cosine.
- dedup.py: new module — detect and remove near-duplicate drawers from
  the same source file using cosine similarity. No API calls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bensig bensig self-requested a review April 10, 2026 15:23

@web3guru888 web3guru888 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the right fix for #525. The add()upsert() change in convo_miner.py is a one-liner but it's load-bearing — we hit this exact HNSW bloat issue in our own setup around 40K drawers where repeated mine runs were pushing duplicates into the graph.

A few observations on the new modules:

repair.py:

  • The scan → prune → rebuild progression is well-designed. The decision to back up only chroma.sqlite3 and skip the bloated HNSW files is correct — that's where the actual drawer data lives.
  • One gap: rebuild_index() does a full collection delete and recreate, which will also lose any embeddings already computed. Users should know that after rebuild, a full re-mine or re-embedding pass will be triggered by ChromaDB's lazy embedding. This is expected behavior but worth noting in the docstring or output message.
  • shutil is imported in rebuild_index but I don't see it in the imports at the top — worth checking.

dedup.py:

  • The greedy longest-first approach is sound for same-source dedup. One edge case: if the same source file was mined under different wings (multi-domain setup), the source_file grouping won't catch cross-wing duplicates. The --source filter helps but a --wing scope filter like in repair.py would be useful here too.
  • DEFAULT_THRESHOLD = 0.15 (cosine distance) is appropriate for near-identical chunks. For looser dedup of paraphrased content, users may want 0.3–0.4 — worth documenting the threshold semantics (cosine distance, not similarity).

Overall: the core fix is correct, the new tooling is genuinely useful for anyone who's been running MemPalace long enough to accumulate HNSW bloat. Strong +1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@milla-jovovich

Copy link
Copy Markdown
Collaborator Author

@web3guru888 sorry, I'm learning to use GitHub on the fly. sorry about the the "shut" not being listed. I appreciate you bearing with us, it;s just ben and I doing all this and I'm just learning.🤦🏻‍♀️

@milla-jovovich

Copy link
Copy Markdown
Collaborator Author

sorry shutil

Addresses community feedback:
- Add --wing flag to scope dedup to a single wing (catches cross-wing
  duplicates when same source mined into multiple wings)
- Document that threshold is cosine distance (not similarity) with
  guidance on values: 0.15 for near-identical, 0.3-0.4 for paraphrased
- Confirmed shutil import is present in repair.py (line 32)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@milla-jovovich

Copy link
Copy Markdown
Collaborator Author

@web3guru888 shutil is actually on line 32

- 18 tests for repair (scan, prune, rebuild, edge cases)
- 15 tests for dedup (grouping, dedup logic, wing filter, stats)
- Fixes coverage drop from adding new modules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@web3guru888

Copy link
Copy Markdown

You're right — I see it now at line 32 of repair.py (import shutil). My mistake, I was scanning the diff and missed it. That nit is a non-issue — the import is present and correct. Apologies for the noise! Everything else in my review stands.

And no need to apologize — you and ben are moving fast and building something people clearly care about. The community response speaks for itself. 👍

MSL and others added 5 commits April 10, 2026 08:49
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…abels

- Add AGENTS.md with build commands, project structure, conventions
- Add .github/dependabot.yml for automated pip + actions updates
- Add .github/CODEOWNERS for review routing
- Expand .gitignore (.env, .DS_Store, IDE configs, coverage, venvs)
- Add C901 complexity rule to ruff (max-complexity=25, benchmarks excluded)
- Add --durations=10 to pytest CI for test performance tracking
- Add docs/schema.sql for knowledge graph schema documentation
- Created P0-P3 priority + area/* + security/performance/docs labels
…sions

Reads documents and metadata directly from ChromaDB's SQLite (bypassing
the API that fails on version-mismatched databases), then reimports into
a fresh palace using the currently installed ChromaDB.

Fixes the 3.0.0 → 3.1.0 upgrade path where chromadb was downgraded from
1.5.x to 0.6.x, breaking the on-disk storage format.

- Detects chromadb version from SQLite schema (0.6.x vs 1.x)
- Extracts all drawers with full metadata via raw SQL
- Builds fresh palace in temp dir, swaps atomically
- Backs up original palace before any changes
- Supports --dry-run to preview without modifying

Fixes #457
Delete existing drawers for a file before re-inserting fresh chunks.
Converts re-mines from upsert (hnswlib updatePoint path, thread-unsafe
on macOS ARM + chromadb 0.6.3) into delete+insert (safe addPoint path).

Credit: @StefanKremen (#523)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bensig bensig merged commit 2e8a5a7 into main Apr 10, 2026
6 checks passed
@bensig bensig deleted the fix/525-hnsw-bloat-dedup branch April 10, 2026 16:26
Comment thread mempalace/miner.py
try:
collection.delete(where={"source_file": source_file})
except Exception:
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about this collection type, but I notice constant blanket try ... except Exception.

Is there a better exception or exception base class that can be caught instead? Blanket try except (especially with pass) is kinda discouraged as anything can happen (out of RAM, out of disk, etc).

@WalterUpgrade

Copy link
Copy Markdown

@milla-jovovich You're killing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HNSW link_lists.bin grows to terabytes, causes segfault and APFS orphaned blocks on macOS

5 participants