fix: chunk collection.add() to avoid ChromaDB 5,461-item hard limit by phobicdotno · Pull Request #520 · MemPalace/mempalace

phobicdotno · 2026-04-10T10:29:54Z

Problem

ChromaDB's collection.add() has an undocumented hard limit of ~5,461 items per call. When mining large directories (500+ files producing 5,000+ drawers), the miner can accumulate more items than this limit before flushing, causing either a silent truncation or a crash.

This was discovered during benchmarking on Apple M1 and NVIDIA RTX 4080 SUPER with 500-file directories producing 8,886 drawers per run. The final batch exceeded 5,461 items, crashing both GPU and CPU mining paths identically. See benchmark results and discussion in #515.

This is difficult to diagnose because:

The limit is not documented in ChromaDB's API
The error message doesn't clearly indicate a batch size constraint
Partial flushes can make it look like some files were mined when they weren't
The issue only surfaces with large directories, so small-scale testing won't catch it

Fix

Added chunked_add() and chunked_upsert() helpers that automatically split large batches into chunks of 5,000 items (safe margin under the 5,461 hard limit). All collection.add() and collection.upsert() call sites updated to use these helpers.

Changes

mempalace/miner.py — Added _CHROMADB_BATCH_LIMIT = 5000 constant, chunked_add(), and chunked_upsert() helpers
mempalace/cli.py — Updated repair command to use chunked_add()
tests/test_miner.py — 5 new tests covering large batches, small batches, empty input, boundary cases, and upsert

Test plan

All existing tests pass
5 new tests for chunked batch operations
Ruff check and format clean
Tested with directories producing 8,000+ drawers — previously crashed, now completes
No new dependencies

… limit

web3guru888 · 2026-04-10T10:32:11Z

This is exactly the right fix and the implementation is clean. The chunked_add() / chunked_upsert() helpers are a good abstraction — callers don't need to know about the limit at all, which means it'll be correctly applied everywhere as you add more call sites.

A few observations from our experience running 9,000+ drawer palaces:

The 5,000 limit choice is correct. We tried 5,200 and hit sporadic failures under concurrent load; 5,000 with headroom is the right call.

chunked_upsert will be needed in add_drawer(). Looking at the current add_drawer() implementation — it accumulates into a buffer and flushes when len(buffer) >= batch_size, where batch_size defaults to 100. So individual flushes will never hit the limit. But if the repair path (which you've already fixed) or any future bulk-load path calls add()/upsert() directly, the wrapper will catch it. The abstraction pays forward.

One minor thing: the progress print in cmd_repair now always shows the full count rather than incremental progress. Previous code printed Re-filed 100/5000..., Re-filed 200/5000..., etc. For very large palaces (15K+ drawers) this is useful feedback. Might be worth having chunked_add accept an optional progress_callback or just printing inside the loop. Not blocking, just a UX note.

Overall: solid fix for a real, hard-to-diagnose footgun. Tested on our end at 9,245 drawers — no issues.

Five new tests covering large batches (6000 items), small batches (100), empty input, boundary (exactly 5000), and upsert idempotency. Uses pre-computed embeddings to avoid downloading the sentence-transformers model during CI.

web3guru888 · 2026-04-10T10:59:21Z

Good catch — the 5,461-item hard limit is one of those ChromaDB gotchas that's genuinely hard to diagnose because it only surfaces at scale. We encountered the silent truncation version of this (not a crash) and it took us a while to correlate partial mining outputs with batch size.

The chunked_add() / chunked_upsert() helper approach is the right fix. A few notes from operating at this scale:

5,000 chunk size is a safe margin, but you might want it configurable. ChromaDB's internal limit has shifted across versions — it was ~5,461 in older versions but the current codebase uses a different internal batching. A constant CHROMA_BATCH_SIZE = 5000 with a note explaining why is better than a magic number, and lets users tune it if they hit issues with a different version.

Progress reporting during chunked adds: for large directories (the 8,886-drawer case you benchmarked), chunked adds will be silent for a while. Even a simple logger.debug("chunk %d/%d added", i+1, total_chunks) helps distinguish "still working" from "stuck."

Partial failure handling: if chunk N fails after chunks 0..N-1 have been committed, the palace is in a partially-mined state. Worth documenting this and recommending the caller re-mine the directory (the existing dedup via file_already_mined() should handle the idempotency). Alternatively, if ChromaDB supports transactional batch commits, that would be cleaner — but I don't think it does.

The test coverage for the 5,461-limit case is exactly right — a synthetic test that creates 6,000 items and confirms all land in the collection is the kind of regression test that would have caught this early.

Good PR for a subtle production issue.

MemPalace-AGI dashboard

nanoscopic · 2026-04-10T14:37:15Z

Some other PRs suggest making the storage mechanism of choice be pluggable.

That would seem to be a better solution than hackily adding code to work around issues specific to the storage DB.

Better to make storage switchable before complicated things further, no?

web3guru888 · 2026-04-10T15:47:42Z

@nanoscopic — that sequencing argument has merit in an ideal world, but in practice: the 5,461-item limit crashes real users right now (#390, #478). Holding the fix until a storage abstraction layer is designed, reviewed, and merged would leave those users broken for weeks.

The two tracks can run in parallel: this PR fixes the immediate correctness issue, #529 proposes the longer-term abstraction. If #529 lands, the chunked helper either gets absorbed into the backend interface or remains as a ChromaDB-specific utility — either way it's small enough not to create a meaningful maintenance burden.

bensig · 2026-04-11T06:24:37Z

Some other PRs suggest making the storage mechanism of choice be pluggable.

That would seem to be a better solution than hackily adding code to work around issues specific to the storage DB.

Better to make storage switchable before complicated things further, no?

switching storage might be the way to go

igorls · 2026-05-08T11:00:46Z

Hi, thanks for the contribution.

This PR has merge conflicts with develop, and the branch has not been updated in over 7 days, which puts it before our most recent release. The conflicts are likely against work that landed in that release.

Could you rebase onto develop so we can take another look?

If this change is no longer relevant, feel free to close the PR.

(This message is part of a periodic backlog pass, sent to all open PRs that match this state.)

phobicdotno · 2026-05-14T10:29:38Z

Closing — commit a4868a3 ("perf(mining): batch per-chunk upserts and add optional GPU acceleration", Apr 24) already shipped chunked batching with the same 5,000 limit, baked into _rebuild_collection_via_temp in repair.py.

Would have appreciated a heads-up given this PR sat open for 14 days before that commit landed, and the templated stale-PR message arrived after.

fix: chunk collection.add() calls to avoid ChromaDB's 5,461-item hard…

438794c

… limit

This was referenced Apr 10, 2026

HNSW link_lists.bin grows to terabytes, causes segfault and APFS orphaned blocks on macOS #525

Closed

Remove Baldfaced Lies Please #524

Closed

web3guru888 mentioned this pull request Apr 10, 2026

V4.0 proposal - Ditch ChromaDB in favor of LanceDB, add sync capabilities #529

Closed

3 tasks

Merge branch 'main' into fix/chromadb-batch-limit

a4b3f36

bensig requested review from bensig and milla-jovovich as code owners April 11, 2026 06:23

bensig changed the base branch from main to develop April 11, 2026 22:21

bensig requested a review from igorls as a code owner April 11, 2026 22:21

igorls added area/cli CLI commands area/mining File and conversation mining bug Something isn't working labels Apr 14, 2026

igorls added the needs-rebase PR has merge conflicts with develop and needs rebase label May 8, 2026

phobicdotno closed this May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: chunk collection.add() to avoid ChromaDB 5,461-item hard limit#520

fix: chunk collection.add() to avoid ChromaDB 5,461-item hard limit#520
phobicdotno wants to merge 3 commits into
MemPalace:developfrom
phobicdotno:fix/chromadb-batch-limit

phobicdotno commented Apr 10, 2026 •

edited

Loading

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

nanoscopic commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

bensig commented Apr 11, 2026

Uh oh!

igorls commented May 8, 2026

Uh oh!

phobicdotno commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

phobicdotno commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Changes

Test plan

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

nanoscopic commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

bensig commented Apr 11, 2026

Uh oh!

igorls commented May 8, 2026

Uh oh!

phobicdotno commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

phobicdotno commented Apr 10, 2026 •

edited

Loading