feat: widen segment offsets from uint32 to uint64 (V4 format)#220
Merged
feat: widen segment offsets from uint32 to uint64 (V4 format)#220
Conversation
cd4cb7c to
1eea22d
Compare
The segment format used uint32 for all logical byte offsets, limiting segments to 4GB of data. With 138M documents, the merged L1 segment exceeds this at ~33GB, causing offset overflow and corrupted query results. This widens all offsets to uint64 while preserving backward compatibility with V3 segments. Dual-struct strategy: V3 legacy structs (read-only) and V4 structs (read/write). On read, version is detected from header and V3 fields are widened to uint64. On write, V4 is always emitted. Natural compaction upgrades V3 segments to V4 over time. Changes: - Add V3 legacy structs (TpSegmentHeaderV3, TpDictEntryV3, TpSkipEntryV3) - Update V4 structs: TpSegmentHeader (88→128 bytes), TpDictEntry (12→16 bytes), TpSkipEntry (16→20 bytes) - Version-aware readers: tp_segment_read_dict_entry(), tp_segment_read_skip_entry() handle V3/V4 transparently - Widen tp_segment_read/get_direct to uint64 logical_offset - Widen pagemapper inline functions to uint64 - V4 write paths in segment.c, merge.c, and build_parallel.c - Version-aware dump functions with PRIu64 format strings - Use palloc_extended with MCXT_ALLOC_HUGE in docmap for >2GB allocations Overhead: ~1.1% on a 33GB index (skip entries 16→20 bytes, dict entries 12→16 bytes). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace WRITE_POSTING_BLOCKS macro in build_parallel.c with a static write_posting_blocks() function and always-chunked processing. This eliminates the two-path branching (bulk vs chunked) by always using chunked collection with a large enough chunk size (8M postings) that most terms complete in a single iteration. Removes the now-unused collect_buffer_term_postings() and estimate_buffer_term_postings() functions. - Fix missing header.data_size assignment in build_parallel.c (pre- existing bug where data_size was never set, leaving it as 0). - Change tp_segment_read_skip_entry() to accept uint64 skip_index_offset directly instead of TpDictEntry*. This removes the fragile tmp_dict workaround in merge.c where a temporary TpDictEntry was constructed just to pass a single field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…it, and worker cap The parallel index build on large datasets (138M MS MARCO v2 passages) consumed unbounded memory because: (1) leader-side palloc/pfree never returned pages to the OS, (2) DSA grew monotonically since dsa_free() only recycles internally, (3) no DSA size limit was set, and (4) writer.pages was leaked each merge cycle. - Wrap each merge cycle in a dedicated MemoryContext so all leader allocations are truly freed when the context is deleted - Call dsa_trim() after clearing worker buffers to release unused DSA segments back to the OS - Set dsa_set_size_limit() based on worker count and spill threshold so workers get a clear ERROR instead of OOMing the machine - Cap worker count to maintenance_work_mem / 32MB per worker - Free writer.pages after tp_segment_writer_finish() (matching merge.c) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
V4's wider uint64 offsets increase segment overhead slightly, causing 500-row hybrid test to produce 8 spilled entries instead of 7.
Use plain palloc/repalloc instead of MCXT_ALLOC_HUGE so that MaxAllocSize trips visibly rather than silently allowing 1GB+ allocations. Add explicit casts for multilevel pointer conversions.
dfa9870 to
13dc08d
Compare
The limit was too restrictive: workers ignore maintenance_work_mem and each buffer can hold up to tp_memtable_spill_threshold postings (~384 MB), so with 4 workers the 1 GB floor was easily exceeded. Proper memory bounding requires reworking the spill threshold to respect maintenance_work_mem, which is a larger follow-up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
uint32touint64, removing the 4GB segment size limit that causes offset overflow with large indexes (e.g., 138M documents producing a ~33GB merged L1 segment)maintenance_work_mempalloc_extendedwithMCXT_ALLOC_HUGEin docmap finalization for allocations exceedingMaxAllocSize(~1GB) on very large segmentsMCXT_ALLOC_HUGEfrom merge vocabulary arrays (segment merge and parallel build buffer merge) — use plainpalloc/repallocsoMaxAllocSizetrips visibly if vocabulary ever reaches ~1GBFormat changes
TpSegmentHeaderTpDictEntryTpSkipEntryEstimated overhead on a 33GB index: ~380MB (~1.1%).
Benchmark: MS MARCO (8.8M passages)
The +38 MB (+3.2%) index size increase is expected from the wider offsets. Build time and query latency differences are within run-to-run variance.
Key design decisions
tp_segment_read_dict_entry()andtp_segment_read_skip_entry()centralize V3→V4 dispatch, avoiding duplicated version logic across scan.c, merge.c, and dump.c.TpSkipEntryremains__attribute__((packed))to minimize posting index overhead (20 bytes vs 24 if aligned).MemoryContextso allocations are truly freed. DSA is trimmed after clearing worker buffers and has a size limit based on worker count.Files changed
src/segment/segment.hsrc/segment/pagemapper.huint64src/segment/segment.cread_dict_entry, PRIu64 format stringssrc/segment/scan.csrc/segment/merge.csrc/segment/docmap.cpalloc_extendedwithMCXT_ALLOC_HUGEfor large docmap allocationssrc/am/build_parallel.csrc/query/bmw.ctest/expected/bmw.outTesting
makecompiles without warningsmake format-checkpassesmake installcheckpasses