Skip to content

Prevent uint16 overflow in segment block_count #144

@tjgreen42

Description

@tjgreen42

Problem

The segment dictionary entry stores block_count as uint16, limiting each term to 65,535 blocks per segment. With TP_BLOCK_SIZE = 128 docs/block, this caps terms at 8.4M postings per segment.

For very large corpora (50M+ docs) with common terms appearing in most documents, a single compacted segment could exceed this limit. The current code has no overflow protection - block_count would silently wrap, corrupting query results.

Affected Code

  • src/segment/segment.h - TpDictEntry.block_count is uint16
  • src/segment/segment.c:972 - calculates num_blocks without overflow check
  • src/segment/merge.c:927 - same issue in merge path

Options

  1. Assert/error if block_count exceeds UINT16_MAX during segment write
  2. Force segment split when approaching the limit (e.g., at 8M docs)
  3. Upgrade to uint32 (increases dictionary size, may affect cache efficiency)

Option 1 is the minimum safety fix. Option 2 would be more graceful for users.

Context

Discovered while reviewing memory allocation for skip list pre-loading. The pre-loaded arrays (block_max_scores, block_last_doc_ids) use block_count from the dictionary entry, so overflow would also cause incorrect memory allocation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions