-
Notifications
You must be signed in to change notification settings - Fork 94
Prevent uint16 overflow in segment block_count #144
Copy link
Copy link
Closed
Description
Problem
The segment dictionary entry stores block_count as uint16, limiting each term to 65,535 blocks per segment. With TP_BLOCK_SIZE = 128 docs/block, this caps terms at 8.4M postings per segment.
For very large corpora (50M+ docs) with common terms appearing in most documents, a single compacted segment could exceed this limit. The current code has no overflow protection - block_count would silently wrap, corrupting query results.
Affected Code
src/segment/segment.h-TpDictEntry.block_countisuint16src/segment/segment.c:972- calculatesnum_blockswithout overflow checksrc/segment/merge.c:927- same issue in merge path
Options
- Assert/error if block_count exceeds UINT16_MAX during segment write
- Force segment split when approaching the limit (e.g., at 8M docs)
- Upgrade to uint32 (increases dictionary size, may affect cache efficiency)
Option 1 is the minimum safety fix. Option 2 would be more graceful for users.
Context
Discovered while reviewing memory allocation for skip list pre-loading. The pre-loaded arrays (block_max_scores, block_last_doc_ids) use block_count from the dictionary entry, so overflow would also cause incorrect memory allocation.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels