fix: prevent HNSW index bloat from duplicate add() calls (#525) by milla-jovovich · Pull Request #544 · MemPalace/mempalace

milla-jovovich · 2026-04-10T15:21:48Z

Summary

Root cause: convo_miner.py used collection.add() instead of upsert(), so repeated mine runs pushed duplicate entries into the HNSW graph. At scale (50K+ drawers) this causes link_lists.bin to grow to terabytes and eventually segfault.
convo_miner.py: one-line fix, add() → upsert()
repair.py: new module — scan for corrupt IDs, prune them, or rebuild the HNSW index from scratch. Backs up only chroma.sqlite3 (not the bloated HNSW files). Recreates collection with hnsw:space=cosine.
dedup.py: new module — detect and remove near-duplicate drawers from the same source file using cosine similarity. No API calls.

Test plan

Run mempalace mine twice on the same directory, confirm no duplicate drawers created
Run python -m mempalace.repair scan on a palace with known corrupt IDs
Run python -m mempalace.dedup --dry-run to verify duplicate detection without deletion
Verify rebuild backs up only chroma.sqlite3, not HNSW files

Fixes #525

Thank you from Milla and Lu ✨

Root cause: convo_miner.py used collection.add() instead of upsert(), so repeated mine runs pushed duplicate entries into the HNSW graph. At scale (50K+ drawers) this causes link_lists.bin to grow to terabytes and eventually segfault. Changes: - convo_miner.py: add() → upsert() (the one-line root cause fix) - repair.py: new module — scan for corrupt IDs, prune them, or rebuild the HNSW index from scratch. Backs up only chroma.sqlite3 (not the bloated HNSW files). Recreates collection with hnsw:space=cosine. - dedup.py: new module — detect and remove near-duplicate drawers from the same source file using cosine similarity. No API calls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

web3guru888

This is exactly the right fix for #525. The add() → upsert() change in convo_miner.py is a one-liner but it's load-bearing — we hit this exact HNSW bloat issue in our own setup around 40K drawers where repeated mine runs were pushing duplicates into the graph.

A few observations on the new modules:

repair.py:

The scan → prune → rebuild progression is well-designed. The decision to back up only chroma.sqlite3 and skip the bloated HNSW files is correct — that's where the actual drawer data lives.
One gap: rebuild_index() does a full collection delete and recreate, which will also lose any embeddings already computed. Users should know that after rebuild, a full re-mine or re-embedding pass will be triggered by ChromaDB's lazy embedding. This is expected behavior but worth noting in the docstring or output message.
shutil is imported in rebuild_index but I don't see it in the imports at the top — worth checking.

dedup.py:

The greedy longest-first approach is sound for same-source dedup. One edge case: if the same source file was mined under different wings (multi-domain setup), the source_file grouping won't catch cross-wing duplicates. The --source filter helps but a --wing scope filter like in repair.py would be useful here too.
DEFAULT_THRESHOLD = 0.15 (cosine distance) is appropriate for near-identical chunks. For looser dedup of paraphrased content, users may want 0.3–0.4 — worth documenting the threshold semantics (cosine distance, not similarity).

Overall: the core fix is correct, the new tooling is genuinely useful for anyone who's been running MemPalace long enough to accumulate HNSW bloat. Strong +1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

milla-jovovich · 2026-04-10T15:38:19Z

@web3guru888 sorry, I'm learning to use GitHub on the fly. sorry about the the "shut" not being listed. I appreciate you bearing with us, it;s just ben and I doing all this and I'm just learning.🤦🏻‍♀️

milla-jovovich · 2026-04-10T15:38:52Z

sorry shutil

Addresses community feedback: - Add --wing flag to scope dedup to a single wing (catches cross-wing duplicates when same source mined into multiple wings) - Document that threshold is cosine distance (not similarity) with guidance on values: 0.15 for near-identical, 0.3-0.4 for paraphrased - Confirmed shutil import is present in repair.py (line 32) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

milla-jovovich · 2026-04-10T15:44:27Z

@web3guru888 shutil is actually on line 32

- 18 tests for repair (scan, prune, rebuild, edge cases) - 15 tests for dedup (grouping, dedup logic, wing filter, stats) - Fixes coverage drop from adding new modules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

web3guru888 · 2026-04-10T15:46:19Z

You're right — I see it now at line 32 of repair.py (import shutil). My mistake, I was scanning the diff and missed it. That nit is a non-issue — the import is present and correct. Apologies for the noise! Everything else in my review stands.

And no need to apologize — you and ben are moving fast and building something people clearly care about. The community response speaks for itself. 👍

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…abels - Add AGENTS.md with build commands, project structure, conventions - Add .github/dependabot.yml for automated pip + actions updates - Add .github/CODEOWNERS for review routing - Expand .gitignore (.env, .DS_Store, IDE configs, coverage, venvs) - Add C901 complexity rule to ruff (max-complexity=25, benchmarks excluded) - Add --durations=10 to pytest CI for test performance tracking - Add docs/schema.sql for knowledge graph schema documentation - Created P0-P3 priority + area/* + security/performance/docs labels

…sions Reads documents and metadata directly from ChromaDB's SQLite (bypassing the API that fails on version-mismatched databases), then reimports into a fresh palace using the currently installed ChromaDB. Fixes the 3.0.0 → 3.1.0 upgrade path where chromadb was downgraded from 1.5.x to 0.6.x, breaking the on-disk storage format. - Detects chromadb version from SQLite schema (0.6.x vs 1.x) - Extracts all drawers with full metadata via raw SQL - Builds fresh palace in temp dir, swaps atomically - Backs up original palace before any changes - Supports --dry-run to preview without modifying Fixes #457

@StefanKremen

Delete existing drawers for a file before re-inserting fresh chunks. Converts re-mines from upsert (hnswlib updatePoint path, thread-unsafe on macOS ARM + chromadb 0.6.3) into delete+insert (safe addPoint path). Credit: @StefanKremen (#523) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

robvdl · 2026-04-12T02:28:05Z

+    try:
+        collection.delete(where={"source_file": source_file})
+    except Exception:
+        pass


I don't know enough about this collection type, but I notice constant blanket try ... except Exception.

Is there a better exception or exception base class that can be caught instead? Blanket try except (especially with pass) is kinda discouraged as anything can happen (out of RAM, out of disk, etc).

WalterUpgrade · 2026-04-13T10:32:50Z

@milla-jovovich You're killing it!

bensig self-requested a review April 10, 2026 15:23

web3guru888 reviewed Apr 10, 2026

View reviewed changes

style: ruff check --fix

e641b80

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: add 33 tests for repair.py and dedup.py

15c5a52

- 18 tests for repair (scan, prune, rebuild, edge cases) - 15 tests for dedup (grouping, dedup logic, wing filter, stats) - Fixes coverage drop from adding new modules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MSL and others added 5 commits April 10, 2026 08:49

style: ruff format

e30c283

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into fix/525-hnsw-bloat-dedup

d0c9f9b

bensig approved these changes Apr 10, 2026

View reviewed changes

bensig merged commit 2e8a5a7 into main Apr 10, 2026
6 checks passed

bensig deleted the fix/525-hnsw-bloat-dedup branch April 10, 2026 16:26

web3guru888 mentioned this pull request Apr 10, 2026

feat: add LanceDB backend abstraction and migration path #574

Open

robvdl reviewed Apr 12, 2026

View reviewed changes

joshdeleenheer mentioned this pull request Apr 22, 2026

Feature request: support chromadb.HttpClient mode for multi-process palace access #1096

Open

StefanKremen mentioned this pull request Apr 25, 2026

fix: purge stale drawers before re-mine to avoid hnswlib update-path race #523

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent HNSW index bloat from duplicate add() calls (#525)#544

fix: prevent HNSW index bloat from duplicate add() calls (#525)#544
bensig merged 9 commits into
mainfrom
fix/525-hnsw-bloat-dedup

milla-jovovich commented Apr 10, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

milla-jovovich commented Apr 10, 2026

Uh oh!

milla-jovovich commented Apr 10, 2026

Uh oh!

milla-jovovich commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

Uh oh!

robvdl Apr 12, 2026

Uh oh!

WalterUpgrade commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

milla-jovovich commented Apr 10, 2026

Summary

Test plan

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Uh oh!

milla-jovovich commented Apr 10, 2026

Uh oh!

milla-jovovich commented Apr 10, 2026

Uh oh!

milla-jovovich commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

Uh oh!

robvdl Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

WalterUpgrade commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants