perf: batch ChromaDB inserts in miner - 10-30x faster mining by midweste · Pull Request #1085 · MemPalace/mempalace

midweste · 2026-04-21T23:01:58Z

perf: batch ChromaDB inserts in miner — 10-30x faster mining

Problem

process_file() called add_drawer() once per chunk, resulting in:

N separate ONNX embedding passes per file (one per chunk)
N separate ChromaDB upsert calls per file
N redundant datetime.now() and os.path.getmtime() syscalls per file

On large files (1000+ chunks), this made mining slow.

Solution

1. DRY metadata construction via _build_drawer() helper

Extracted the drawer ID generation and metadata assembly into a shared _build_drawer() function used by both:

add_drawer() — single-insert path (unchanged contract, used by MCP tools)
add_drawers() — new batch-insert path (used by process_file())

2. Batched upserts with sub-batching

add_drawers() collects all chunks into batch lists and upserts in groups of CHROMA_BATCH_LIMIT (5000) to stay under ChromaDB's hard cap of 5,461.

3. Hoisted syscalls

datetime.now() and os.path.getmtime() are called once per file and shared across all chunks via _build_drawer(now=..., source_mtime=...).

Results

Metric	Before	After
ONNX passes per file	N (per chunk)	⌈N/5000⌉
DB writes per file	N	⌈N/5000⌉
Syscalls per file	2N	2
Metadata construction	duplicated in `add_drawer` + `process_file`	single `_build_drawer` helper

Tested against a 499-file project (~70k drawers) with zero errors, including an 18,803-chunk benchmark file.

Breaking Changes

None. add_drawer() retains its existing signature and behavior. add_drawers() is additive.

…nserts) README fork-change-queue gets a Performance row above the L1 importance pre-filter row. CLAUDE.md gets a row 27 noting the conflict-resolution choices (kept fork's DRAWER_UPSERT_BATCH_SIZE=1000, aliased upstream's CHROMA_BATCH_LIMIT to it). The cherry-pick is already on upstream develop, so this row exists mainly to document the merge-conflict resolution for the next upstream→main sync. Will become a no-op when develop ships. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@midweste

I wrote "already in upstream develop" without running gh pr view, and JP caught it. MemPalace#1085 was created 2026-04-21, still open against develop. Re-framed: fork cherry-picked an open upstream PR; the row clears when MemPalace#1085 lands and we sync develop→main. We don't file a competing fork-side PR — the proposal is @midweste's; we wait. Lesson re-applied from feedback_verify_merged_not_described.md: always read upstream state via gh, never infer from the README's PR list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Upstream CHANGELOG.md tracks released versions of MemPalace and is merged from upstream — mixing fork-ahead changes into it would create upstream-sync conflicts. This file is the supplement: date-based sections (no semver — fork doesn't cut its own release tags), Keep-a-Changelog buckets (Added / Changed / Fixed / Performance). When a fork-ahead row lands upstream, the entry moves to the "Merged into upstream" section at the bottom, kept ~2 weeks then trimmed. Backfilled with 2026-04-25 + 2026-04-26 entries covering: the checkpoint collection split (phases A–D, PreCompact incorporation, the new MCP tool), the kind= filter (transitional safety net), the HNSW cold-start gate + integrity gate, drawer_id surfacing, the deploy script, the MemPalace#1085 cherry-pick, palace_graph None guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Prevent HNSW index corruption by deferring Ctrl+C until the current file's ChromaDB upsert completes, then flushing the background compactor before process exit. - Add safe_mine_session context manager (palace.py) centralizing SIGINT deferral + close_palace flush + signal restoration order - Apply to both mine() and mine_convos() entry points - Replace silent except-pass with logger.warning on close failure - Add 18 tests covering lifecycle, signal handling, and close paths Closes: HNSW corruption on Ctrl+C during mining

- Register atexit.register(_DEFAULT_BACKEND.close) in palace.py to flush ChromaDB on MCP server process termination (SIGTERM, sys.exit, etc.) - Add ChromaBackend._cache_key() to normalize palace paths via Path.resolve(), preventing silent cache misses when close_palace() receives a different path format than _client() used - Add 8 tests covering atexit, close idempotency, and path normalization

midweste · 2026-05-05T09:13:11Z

Closing this PR — the core optimization (batched ChromaDB upserts) was independently implemented in #1185 (perf/batched-upsert-gpu), which expanded on the same approach with GPU acceleration support and hardware-aware embedding functions.

All key improvements from this PR are now covered in develop:

DRY metadata builder (_build_drawer_metadata)
Batched collection.upsert() per file instead of per-chunk
Hoisted os.path.getmtime() out of the per-chunk loop
Sub-batching via DRAWER_UPSERT_BATCH_SIZE to stay under ChromaDB's hard cap
Thanks for landing the expanded version! 🎉

@midweste

…MemPalace#1185 (#36) (#89) Upstream PR MemPalace#1085 was closed by @midweste on 2026-05-05 in favor of [MemPalace#1185](MemPalace#1185) which merged to develop on 2026-04-24. MemPalace#1185 is a wider-scope superseder: same per-chunk batch-insert path plus optional GPU acceleration. Our cherry-pick at 6be6fff is now functionally redundant with develop. Updated three fork docs that previously claimed MemPalace#1085 was open and implied the cherry-pick would become a no-op only when MemPalace#1085 itself merged: - FORK_CHANGELOG.md (the canonical cherry-pick entry) - README.md "Composition layer" callout - README.md "Cherry-picks" comparison table row Cherry-pick can be dropped on the next upstream develop sync; flagged that on all three rows. Closes #36 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

@midweste

* docs: update jphein/ URLs to techempower-org/ (#118) - render-docs.py: changelog header and commit_link() now point at techempower-org/mempalace so the FORK_CHANGELOG regenerates with the correct repo slug. - check-docs.sh: cross-repo URL filter accepts both jphein/ and techempower-org/ slugs while the YAML body still carries historical jphein/ narrative; the retired-check comment updated for symmetry. - README.md: sweep all https://github.com/jphein/mempalace and https://github.com/jphein/palace-daemon URLs to their techempower-org equivalents (PR links, commit links, deployment-runs-at slug). - FORK_CHANGELOG.md: regenerated from docs/fork-changes.yaml so every rendered commit link resolves to techempower-org/mempalace; the historical body text inside YAML entries is unchanged. Historical narrative ("transferred from jphein/mempalace in May 2026") and links to other personal forks (jphein/familiar.realm.watch, jphein/notebook, etc.) stay as-is — only the two repos that actually moved to techempower-org are rewritten. Closes #118 * docs: mark PR MemPalace#1024 merged, MemPalace#1085 closed; drop stale ChromaDB references - docs/fork-changes.yaml: PR MemPalace#1024 (configurable chunking) moved from OPEN to MERGED with merged_date 2026-05-15; added to merged_upstream entries so the FORK_CHANGELOG recently-merged section reflects the 2026-05-15 upstream land. - docs/fork-changes.yaml: PR MemPalace#1085 (batched ChromaDB inserts cherry-pick) moved from OPEN to CLOSED with body note that @midweste closed it 2026-05-16, superseded by merged upstream MemPalace#1185 (wider GPU-acceleration scope). Fork's 6be6fff is now a no-op against develop. - CLAUDE.md: Key Files block rewrites the stale "ChromaDB vector store. The actual data." line to point at postgres on disks.jphe.in via palace-daemon and reference the 2026-05-14 cutover + RETIRED marker. - CLAUDE.md: Two-Layer Memory Architecture line updates MemPalace storage from "~/.mempalace/palace/, ~183K drawers" to the production postgres + pgvector + AGE on disks.jphe.in with a normalized "300K+" drawer count. README.md drawer-count normalization (160K/273K/274K -> 300K+) and ChromaDB->pgvector edits landed in efcb5c6 alongside the #118 URL sweep. FORK_CHANGELOG.md regenerates clean from the updated YAML. Closes #115, #116 * fix(docs): unbreak check-docs test-count lint and converge on .venv The README test-count drift check in scripts/check-docs.sh anchored its regex at start-of-line, but the count lives mid-paragraph on README:28. The check silently no-op'd, letting "~1850 tests pass on main" drift to 2486 actual without a single warning (#114). De-anchor the regex; the rest of the pipeline already does the right thing (counts via pytest --collect-only). Also converge on .venv/ — the project's actual venv path — instead of the dead venv/ path hard-coded in scripts/check-docs.sh and scripts/preflight.sh. Fixes the silent fallback-to-PATH behavior that masked the wrong path for so long (#117). CLAUDE.md:59 was already inconsistent with :46 — both now say .venv. Refs: #114, #117

…) (#197) Issue #165 asked to drop the MemPalace#1085 cherry-pick on the next upstream sync, on the premise that upstream MemPalace#1185 (merged) had superseded it. That premise held on 2026-05-16 but is no longer accurate: in the 73 commits to miner.py / convo_miner.py / format_miner.py since the cherry-pick landed, the fork built on top of these primitives in ways upstream MemPalace#1185 does not provide: 1. add_drawers() is a fork-only public API that returns (added, batch_ids, warnings) and is wired to room-taxonomy validation (#86). Consumed by test_room_taxonomy.py. 2. DRAWER_UPSERT_BATCH_SIZE / CHROMA_BATCH_LIMIT sub-batching knob — upstream does one giant upsert per file (OOM risk on pathological files); fork sub-batches in groups of 1000. Used from miner.py, convo_miner.py, format_miner.py and monkeypatched by tests. 3. _build_drawer_metadata carries fork-only Tier 6a extensions (line_start, line_end, content_date) that closet pointers depend on. Tested at tests/test_miner.py. Updates the fork-changes.yaml entry to mark the cherry-pick as "absorbed into fork architecture" rather than "drop on next sync", and regenerates FORK_CHANGELOG.md from the YAML. Closes #165.

midweste requested review from bensig and milla-jovovich as code owners April 21, 2026 23:01

midweste force-pushed the midweste branch from 23e15bd to 4ea2804 Compare April 21, 2026 23:36

yangqingggui-a11y mentioned this pull request Apr 24, 2026

Add Apple Silicon acceleration (MPS/explicit CPU) for embedding — up to 256x faster (addresses #1068) #1133

Open

igorls added performance Performance improvements area/mining File and conversation mining storage labels Apr 24, 2026

midwestE added 2 commits May 5, 2026 04:10

midweste force-pushed the midweste branch from 4ea2804 to a9f6b10 Compare May 5, 2026 09:11

midweste requested a review from igorls as a code owner May 5, 2026 09:11

midweste closed this May 5, 2026

This was referenced May 11, 2026

Update CLAUDE.md row 27: upstream #1085 is closed/superseded by merged #1185 techempower-org/mempalace#36

Closed

docs: refresh #1085 cherry-pick status — superseded by merged #1185 (#36) techempower-org/mempalace#89

Merged

jphein mentioned this pull request May 22, 2026

Docs: upstream PR #1024 merged but still listed as open in README + FORK_CHANGELOG techempower-org/mempalace#115

Closed

This was referenced May 23, 2026

Drop stale #1085 cherry-pick on next upstream sync techempower-org/mempalace#165

Closed

docs: reassess cherry-pick-1085 — absorbed into fork architecture (#165) techempower-org/mempalace#197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: batch ChromaDB inserts in miner - 10-30x faster mining#1085

perf: batch ChromaDB inserts in miner - 10-30x faster mining#1085
midweste wants to merge 2 commits into
MemPalace:developfrom
midweste:midweste

midweste commented Apr 21, 2026 •

edited

Loading

Uh oh!

midweste commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

midweste commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!