Skip to content

feat: widen segment offsets from uint32 to uint64 (V4 format)#220

Merged
tjgreen42 merged 7 commits intomainfrom
feat/v4-uint64-segment-offsets
Feb 17, 2026
Merged

feat: widen segment offsets from uint32 to uint64 (V4 format)#220
tjgreen42 merged 7 commits intomainfrom
feat/v4-uint64-segment-offsets

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

@tjgreen42 tjgreen42 commented Feb 11, 2026

Summary

  • Widen all segment logical byte offsets from uint32 to uint64, removing the 4GB segment size limit that causes offset overflow with large indexes (e.g., 138M documents producing a ~33GB merged L1 segment)
  • Introduce V3/V4 format versioning with backward-compatible readers: V3 segments are read and widened transparently, writes always produce V4, and natural compaction upgrades V3 to V4 over time
  • Bound parallel build memory: per-cycle memory context, DSA trim/limit, and worker cap based on maintenance_work_mem
  • Use palloc_extended with MCXT_ALLOC_HUGE in docmap finalization for allocations exceeding MaxAllocSize (~1GB) on very large segments
  • Remove MCXT_ALLOC_HUGE from merge vocabulary arrays (segment merge and parallel build buffer merge) — use plain palloc/repalloc so MaxAllocSize trips visibly if vocabulary ever reaches ~1GB

Format changes

Component V3 (legacy) V4 (new) Overhead
TpSegmentHeader 88 bytes 128 bytes 1 per segment
TpDictEntry 12 bytes 16 bytes +33% per term
TpSkipEntry 16 bytes 20 bytes +25% per block

Estimated overhead on a 33GB index: ~380MB (~1.1%).

Benchmark: MS MARCO (8.8M passages)

Metric main V4 (this branch) Delta
Index build time 6m 4.5s 5m 50.1s -4.0% (noise)
Index size 1,189 MB 1,227 MB +3.2% (+38 MB)
Query latency (avg) 10.63 ms 10.33 ms -2.8% (noise)

The +38 MB (+3.2%) index size increase is expected from the wider offsets. Build time and query latency differences are within run-to-run variance.

Key design decisions

  1. Dual-struct strategy: V3 legacy structs are preserved read-only. On read, version is detected from the header magic, and V3 fields are widened to V4. On write, V4 is always emitted.
  2. Extern version-aware readers: tp_segment_read_dict_entry() and tp_segment_read_skip_entry() centralize V3→V4 dispatch, avoiding duplicated version logic across scan.c, merge.c, and dump.c.
  3. Packed skip entries: TpSkipEntry remains __attribute__((packed)) to minimize posting index overhead (20 bytes vs 24 if aligned).
  4. Bounded parallel build memory: Each merge cycle uses a dedicated MemoryContext so allocations are truly freed. DSA is trimmed after clearing worker buffers and has a size limit based on worker count.

Files changed

File Changes
src/segment/segment.h V3 legacy structs, V4 widened structs, version constants, helpers
src/segment/pagemapper.h Widen inline functions to uint64
src/segment/segment.c Version-aware open, widened read/write, extern read_dict_entry, PRIu64 format strings
src/segment/scan.c Version-aware dict/skip entry reads via centralized functions
src/segment/merge.c Widened offset fields, version-aware source reads, V4 output, plain palloc for vocabulary
src/segment/docmap.c palloc_extended with MCXT_ALLOC_HUGE for large docmap allocations
src/am/build_parallel.c Widened offset fields, V4 output, per-cycle memory context, DSA trim/limit, worker cap, plain palloc for vocabulary
src/query/bmw.c Updated expected spill count in BMW test
test/expected/bmw.out Updated expected test output

Testing

  • make compiles without warnings
  • make format-check passes
  • make installcheck passes
  • Benchmark on 8.8M MS MARCO: +3.2% index size, query performance within noise
  • 138M document index build verifies correct uint64 offsets on ~33GB merged segment
  • Queries return correct results on the large index (no offset overflow)

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Feb 11, 2026

CLA assistant check
All committers have signed the CLA.

@tjgreen42 tjgreen42 force-pushed the feat/v4-uint64-segment-offsets branch 3 times, most recently from cd4cb7c to 1eea22d Compare February 12, 2026 17:16
tjgreen42 and others added 5 commits February 17, 2026 11:16
The segment format used uint32 for all logical byte offsets, limiting
segments to 4GB of data. With 138M documents, the merged L1 segment
exceeds this at ~33GB, causing offset overflow and corrupted query
results. This widens all offsets to uint64 while preserving backward
compatibility with V3 segments.

Dual-struct strategy: V3 legacy structs (read-only) and V4 structs
(read/write). On read, version is detected from header and V3 fields
are widened to uint64. On write, V4 is always emitted. Natural
compaction upgrades V3 segments to V4 over time.

Changes:
- Add V3 legacy structs (TpSegmentHeaderV3, TpDictEntryV3, TpSkipEntryV3)
- Update V4 structs: TpSegmentHeader (88→128 bytes), TpDictEntry (12→16
  bytes), TpSkipEntry (16→20 bytes)
- Version-aware readers: tp_segment_read_dict_entry(),
  tp_segment_read_skip_entry() handle V3/V4 transparently
- Widen tp_segment_read/get_direct to uint64 logical_offset
- Widen pagemapper inline functions to uint64
- V4 write paths in segment.c, merge.c, and build_parallel.c
- Version-aware dump functions with PRIu64 format strings
- Use palloc_extended with MCXT_ALLOC_HUGE in docmap for >2GB allocations

Overhead: ~1.1% on a 33GB index (skip entries 16→20 bytes, dict entries
12→16 bytes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace WRITE_POSTING_BLOCKS macro in build_parallel.c with a static
  write_posting_blocks() function and always-chunked processing. This
  eliminates the two-path branching (bulk vs chunked) by always using
  chunked collection with a large enough chunk size (8M postings) that
  most terms complete in a single iteration. Removes the now-unused
  collect_buffer_term_postings() and estimate_buffer_term_postings()
  functions.

- Fix missing header.data_size assignment in build_parallel.c (pre-
  existing bug where data_size was never set, leaving it as 0).

- Change tp_segment_read_skip_entry() to accept uint64 skip_index_offset
  directly instead of TpDictEntry*. This removes the fragile tmp_dict
  workaround in merge.c where a temporary TpDictEntry was constructed
  just to pass a single field.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…it, and worker cap

The parallel index build on large datasets (138M MS MARCO v2 passages)
consumed unbounded memory because: (1) leader-side palloc/pfree never
returned pages to the OS, (2) DSA grew monotonically since dsa_free()
only recycles internally, (3) no DSA size limit was set, and (4)
writer.pages was leaked each merge cycle.

- Wrap each merge cycle in a dedicated MemoryContext so all leader
  allocations are truly freed when the context is deleted
- Call dsa_trim() after clearing worker buffers to release unused DSA
  segments back to the OS
- Set dsa_set_size_limit() based on worker count and spill threshold
  so workers get a clear ERROR instead of OOMing the machine
- Cap worker count to maintenance_work_mem / 32MB per worker
- Free writer.pages after tp_segment_writer_finish() (matching merge.c)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
V4's wider uint64 offsets increase segment overhead slightly,
causing 500-row hybrid test to produce 8 spilled entries instead of 7.
Use plain palloc/repalloc instead of MCXT_ALLOC_HUGE so that
MaxAllocSize trips visibly rather than silently allowing 1GB+
allocations. Add explicit casts for multilevel pointer conversions.
@tjgreen42 tjgreen42 force-pushed the feat/v4-uint64-segment-offsets branch from dfa9870 to 13dc08d Compare February 17, 2026 19:16
@tjgreen42 tjgreen42 marked this pull request as ready for review February 17, 2026 19:27
The limit was too restrictive: workers ignore maintenance_work_mem
and each buffer can hold up to tp_memtable_spill_threshold postings
(~384 MB), so with 4 workers the 1 GB floor was easily exceeded.
Proper memory bounding requires reworking the spill threshold to
respect maintenance_work_mem, which is a larger follow-up.
@tjgreen42 tjgreen42 merged commit 45437f1 into main Feb 17, 2026
19 checks passed
@tjgreen42 tjgreen42 deleted the feat/v4-uint64-segment-offsets branch February 17, 2026 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants