fix: chunk collection.add() to avoid ChromaDB 5,461-item hard limit#520
fix: chunk collection.add() to avoid ChromaDB 5,461-item hard limit#520phobicdotno wants to merge 3 commits into
Conversation
|
This is exactly the right fix and the implementation is clean. The A few observations from our experience running 9,000+ drawer palaces: The 5,000 limit choice is correct. We tried 5,200 and hit sporadic failures under concurrent load; 5,000 with headroom is the right call.
One minor thing: the progress print in Overall: solid fix for a real, hard-to-diagnose footgun. Tested on our end at 9,245 drawers — no issues. |
Five new tests covering large batches (6000 items), small batches (100), empty input, boundary (exactly 5000), and upsert idempotency. Uses pre-computed embeddings to avoid downloading the sentence-transformers model during CI.
|
Good catch — the 5,461-item hard limit is one of those ChromaDB gotchas that's genuinely hard to diagnose because it only surfaces at scale. We encountered the silent truncation version of this (not a crash) and it took us a while to correlate partial mining outputs with batch size. The 5,000 chunk size is a safe margin, but you might want it configurable. ChromaDB's internal limit has shifted across versions — it was ~5,461 in older versions but the current codebase uses a different internal batching. A constant Progress reporting during chunked adds: for large directories (the 8,886-drawer case you benchmarked), chunked adds will be silent for a while. Even a simple Partial failure handling: if chunk N fails after chunks 0..N-1 have been committed, the palace is in a partially-mined state. Worth documenting this and recommending the caller re-mine the directory (the existing dedup via The test coverage for the 5,461-limit case is exactly right — a synthetic test that creates 6,000 items and confirms all land in the collection is the kind of regression test that would have caught this early. Good PR for a subtle production issue. |
|
Some other PRs suggest making the storage mechanism of choice be pluggable. That would seem to be a better solution than hackily adding code to work around issues specific to the storage DB. Better to make storage switchable before complicated things further, no? |
|
@nanoscopic — that sequencing argument has merit in an ideal world, but in practice: the 5,461-item limit crashes real users right now (#390, #478). Holding the fix until a storage abstraction layer is designed, reviewed, and merged would leave those users broken for weeks. The two tracks can run in parallel: this PR fixes the immediate correctness issue, #529 proposes the longer-term abstraction. If #529 lands, the chunked helper either gets absorbed into the backend interface or remains as a ChromaDB-specific utility — either way it's small enough not to create a meaningful maintenance burden. |
switching storage might be the way to go |
|
Hi, thanks for the contribution. This PR has merge conflicts with Could you rebase onto If this change is no longer relevant, feel free to close the PR. (This message is part of a periodic backlog pass, sent to all open PRs that match this state.) |
|
Closing — commit Would have appreciated a heads-up given this PR sat open for 14 days before that commit landed, and the templated stale-PR message arrived after. |
Problem
ChromaDB's
collection.add()has an undocumented hard limit of ~5,461 items per call. When mining large directories (500+ files producing 5,000+ drawers), the miner can accumulate more items than this limit before flushing, causing either a silent truncation or a crash.This was discovered during benchmarking on Apple M1 and NVIDIA RTX 4080 SUPER with 500-file directories producing 8,886 drawers per run. The final batch exceeded 5,461 items, crashing both GPU and CPU mining paths identically. See benchmark results and discussion in #515.
This is difficult to diagnose because:
Fix
Added
chunked_add()andchunked_upsert()helpers that automatically split large batches into chunks of 5,000 items (safe margin under the 5,461 hard limit). Allcollection.add()andcollection.upsert()call sites updated to use these helpers.Changes
mempalace/miner.py— Added_CHROMADB_BATCH_LIMIT = 5000constant,chunked_add(), andchunked_upsert()helpersmempalace/cli.py— Updatedrepaircommand to usechunked_add()tests/test_miner.py— 5 new tests covering large batches, small batches, empty input, boundary cases, and upsertTest plan