fix: prevent HNSW index bloat from duplicate add() calls (#525)#544
Conversation
Root cause: convo_miner.py used collection.add() instead of upsert(), so repeated mine runs pushed duplicate entries into the HNSW graph. At scale (50K+ drawers) this causes link_lists.bin to grow to terabytes and eventually segfault. Changes: - convo_miner.py: add() → upsert() (the one-line root cause fix) - repair.py: new module — scan for corrupt IDs, prune them, or rebuild the HNSW index from scratch. Backs up only chroma.sqlite3 (not the bloated HNSW files). Recreates collection with hnsw:space=cosine. - dedup.py: new module — detect and remove near-duplicate drawers from the same source file using cosine similarity. No API calls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
web3guru888
left a comment
There was a problem hiding this comment.
This is exactly the right fix for #525. The add() → upsert() change in convo_miner.py is a one-liner but it's load-bearing — we hit this exact HNSW bloat issue in our own setup around 40K drawers where repeated mine runs were pushing duplicates into the graph.
A few observations on the new modules:
repair.py:
- The scan → prune → rebuild progression is well-designed. The decision to back up only
chroma.sqlite3and skip the bloated HNSW files is correct — that's where the actual drawer data lives. - One gap:
rebuild_index()does a full collection delete and recreate, which will also lose any embeddings already computed. Users should know that afterrebuild, a full re-mine or re-embedding pass will be triggered by ChromaDB's lazy embedding. This is expected behavior but worth noting in the docstring or output message. shutilis imported inrebuild_indexbut I don't see it in the imports at the top — worth checking.
dedup.py:
- The greedy longest-first approach is sound for same-source dedup. One edge case: if the same source file was mined under different wings (multi-domain setup), the
source_filegrouping won't catch cross-wing duplicates. The--sourcefilter helps but a--wingscope filter like inrepair.pywould be useful here too. DEFAULT_THRESHOLD = 0.15(cosine distance) is appropriate for near-identical chunks. For looser dedup of paraphrased content, users may want 0.3–0.4 — worth documenting the threshold semantics (cosine distance, not similarity).
Overall: the core fix is correct, the new tooling is genuinely useful for anyone who's been running MemPalace long enough to accumulate HNSW bloat. Strong +1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@web3guru888 sorry, I'm learning to use GitHub on the fly. sorry about the the "shut" not being listed. I appreciate you bearing with us, it;s just ben and I doing all this and I'm just learning.🤦🏻♀️ |
|
sorry shutil |
Addresses community feedback: - Add --wing flag to scope dedup to a single wing (catches cross-wing duplicates when same source mined into multiple wings) - Document that threshold is cosine distance (not similarity) with guidance on values: 0.15 for near-identical, 0.3-0.4 for paraphrased - Confirmed shutil import is present in repair.py (line 32) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@web3guru888 shutil is actually on line 32 |
- 18 tests for repair (scan, prune, rebuild, edge cases) - 15 tests for dedup (grouping, dedup logic, wing filter, stats) - Fixes coverage drop from adding new modules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
You're right — I see it now at line 32 of And no need to apologize — you and ben are moving fast and building something people clearly care about. The community response speaks for itself. 👍 |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…abels - Add AGENTS.md with build commands, project structure, conventions - Add .github/dependabot.yml for automated pip + actions updates - Add .github/CODEOWNERS for review routing - Expand .gitignore (.env, .DS_Store, IDE configs, coverage, venvs) - Add C901 complexity rule to ruff (max-complexity=25, benchmarks excluded) - Add --durations=10 to pytest CI for test performance tracking - Add docs/schema.sql for knowledge graph schema documentation - Created P0-P3 priority + area/* + security/performance/docs labels
…sions Reads documents and metadata directly from ChromaDB's SQLite (bypassing the API that fails on version-mismatched databases), then reimports into a fresh palace using the currently installed ChromaDB. Fixes the 3.0.0 → 3.1.0 upgrade path where chromadb was downgraded from 1.5.x to 0.6.x, breaking the on-disk storage format. - Detects chromadb version from SQLite schema (0.6.x vs 1.x) - Extracts all drawers with full metadata via raw SQL - Builds fresh palace in temp dir, swaps atomically - Backs up original palace before any changes - Supports --dry-run to preview without modifying Fixes #457
Delete existing drawers for a file before re-inserting fresh chunks. Converts re-mines from upsert (hnswlib updatePoint path, thread-unsafe on macOS ARM + chromadb 0.6.3) into delete+insert (safe addPoint path). Credit: @StefanKremen (#523) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| try: | ||
| collection.delete(where={"source_file": source_file}) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
I don't know enough about this collection type, but I notice constant blanket try ... except Exception.
Is there a better exception or exception base class that can be caught instead? Blanket try except (especially with pass) is kinda discouraged as anything can happen (out of RAM, out of disk, etc).
|
@milla-jovovich You're killing it! |
Summary
convo_miner.pyusedcollection.add()instead ofupsert(), so repeated mine runs pushed duplicate entries into the HNSW graph. At scale (50K+ drawers) this causeslink_lists.binto grow to terabytes and eventually segfault.convo_miner.py: one-line fix,add()→upsert()repair.py: new module — scan for corrupt IDs, prune them, or rebuild the HNSW index from scratch. Backs up onlychroma.sqlite3(not the bloated HNSW files). Recreates collection withhnsw:space=cosine.dedup.py: new module — detect and remove near-duplicate drawers from the same source file using cosine similarity. No API calls.Test plan
mempalace minetwice on the same directory, confirm no duplicate drawers createdpython -m mempalace.repair scanon a palace with known corrupt IDspython -m mempalace.dedup --dry-runto verify duplicate detection without deletionrebuildbacks up onlychroma.sqlite3, not HNSW filesFixes #525
Thank you from Milla and Lu ✨