Skip to content

Implement V2 segment format with block storage for BMW optimization#81

Merged
tjgreen42 merged 36 commits intomainfrom
block-storage-phase1
Dec 22, 2025
Merged

Implement V2 segment format with block storage for BMW optimization#81
tjgreen42 merged 36 commits intomainfrom
block-storage-phase1

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

@tjgreen42 tjgreen42 commented Dec 17, 2025

Summary

  • Implements Phase 1 of block storage optimization with V2 segment format
  • Block-based posting storage (128 docs/block) with skip index for future BMW optimization
  • Fieldnorm table using Lucene SmallFloat encoding (1 byte per doc)
  • Segment-local document IDs with CTID map for heap tuple lookup
  • All segment writes switched to V2 format (breaking change, metapage version 5)

Compatibility

  • Read: Supports both V1 and V2 segments (query code branches on header->version)
  • Write: V2 only - all new segments use V2 format
  • Upgrade path: Existing indexes must be rebuilt (standard for pre-1.0 releases)

Key Changes

  • New files: docmap.c/h (CTID mapping), fieldnorm.h (length quantization)
  • segment_merge.c: Complete rewrite to read V2 sources and write V2 output
  • segment_query.c: V2 query iterator with block-based iteration
  • segment.c: V2 writer and dump function updates

Format Details

  • TpBlockPosting (8 bytes): doc_id + frequency (down from 14-byte TpSegmentPosting)
  • TpSkipEntry (16 bytes): per-block statistics for BMW block skipping
  • TpDictEntryV2 (12 bytes): skip_index_offset + block_count + doc_freq

Configuration Changes

  • Memtable spill threshold: Bumped from 800K to 32M posting entries (~1M docs/segment)
  • Benchmark config: segments_per_level = 16 for fewer merges during benchmarking
  • Segment statistics: Added bm25_summarize_index() calls to benchmark load scripts

Testing

  • All 29 regression tests pass
  • Merge, segment, and scoring tests validate V2 format correctness

tjgreen42 and others added 17 commits December 17, 2025 13:26
This implements Phase 1 of the block storage optimization plan:

V2 segment format changes:
- Block-based posting storage: 128 docs per block with TpBlockPosting (8 bytes)
- Skip index with TpSkipEntry (16 bytes) enabling BMW block skipping
- Fieldnorm table using Lucene SmallFloat encoding (1 byte per doc)
- CTID map for segment-local doc_id → heap tuple lookup
- TpDictEntryV2 (12 bytes) with skip_index_offset + block_count

New infrastructure:
- docmap.c/h: Document ID mapping with CTID hash table
- fieldnorm.h: Lucene SmallFloat encode/decode with precomputed table
- segment_query.c: V2 query iterator with block-based iteration

Key changes:
- All segment writers switched from V1 to V2 (index build, spill, merge)
- Merge completely rewritten to read V2 sources and write V2 output
- Dump function updated to display V2 format details
- Metapage version bumped to 5 (breaking change)

The V2 format reduces storage (posting: 14→8 bytes) and enables
future BMW optimization by providing block-level statistics for
early termination during top-k queries.

V1 code retained but unused - will be removed in future cleanup.
V2 segment queries were slow because the iterator performed per-posting
I/O to look up CTIDs and fieldnorms. This change preloads both tables
into memory when opening a V2 segment reader.

Performance improvement on 50K document benchmark:
- Common term queries: 2.1-2.6x faster
- Buffer hits reduced by 23-34x (e.g., 30K → 900)

Memory overhead per segment reader:
- Fieldnorm: 1 byte per document
- CTID map: 6 bytes per document
- Total: ~7 bytes per document (e.g., 700KB for 100K docs)

The caching remains beneficial with BMW optimization since we still
need CTIDs for result output and fieldnorms for scoring on blocks
that aren't skipped.
The unconditional caching caused severe regression on large segments
(MS MARCO 8.8M docs was 5-7x slower) because:
- Loading 60MB of cache data upfront is expensive
- Top-k queries only access a small fraction of documents
- PostgreSQL's buffer cache efficiently handles sparse access patterns

With 100K threshold:
- Small segments (flush, early compaction): cached for fast iteration
- Large segments (late compaction): use per-posting reads via buffer cache
Document the fieldnorm storage strategy for V2 block storage:

- Uncompressed format stores fieldnorm inline in TpBlockPosting
  (uses existing padding bytes, zero size overhead)
- Documents observed high buffer manager overhead (~300-500ns per
  lookup) that motivated this change
- Compressed formats will use separate fieldnorm table with caching
  (inline would cause 40-80% size inflation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Eliminates per-posting fieldnorm lookups during query execution by
storing the encoded fieldnorm directly in each TpBlockPosting entry.
This uses existing padding bytes, so there's zero storage overhead.

Changes:
- TpBlockPosting: uint8 fieldnorm replaces part of reserved padding
- V2 segment writer: populates fieldnorm from docmap during write
- V2 segment reader: reads fieldnorm directly from posting
- Segment merge: reads/writes inline fieldnorm during merge
- Removed cached_fieldnorms from TpSegmentReader (no longer needed)

The separate fieldnorm table is still written for backward compatibility
and potential future use with compressed formats.
Adds a separate 'profile' job that runs on manual dispatch:
- Uses perf to capture CPU profiles during index creation
- Generates flamegraph.svg for visual analysis
- Uses MS MARCO 100K subset for manageable profile size
- Uploads profile artifacts for download
- Shows top functions in job summary

The profile job runs in parallel with the main benchmark job
but produces different artifacts focused on identifying
performance bottlenecks rather than timing measurements.
- Add enable_profile input (default: false) to opt-in to profiling
- Use selected msmarco_size for profiling (not hardcoded 100K)
- Increase timeout to 2 hours for full dataset profiling
- Increase perf recording duration to 60 minutes
- Increase memory settings for larger datasets
- Add disk space cleanup for full MS MARCO download
- Show dataset size in profile summary
Profile job now captures two separate flamegraphs:
- index_flamegraph.svg: CPU profile during index creation
- query_flamegraph.svg: CPU profile during query execution

Query profiling runs 300 iterations of short/medium/long queries
at 499 Hz sampling to capture meaningful query-path bottlenecks.

Artifacts uploaded:
- index_flamegraph.svg, query_flamegraph.svg (interactive SVGs)
- index_profile.txt, query_profile.txt (perf reports)
- index_perf.folded, query_perf.folded (raw stacks)
- Use head -n 75 of load.sql to get proper data loading with
  awk-based CSV conversion (handles special characters)
- Fix column name: passage_text (not passage)
- Fixes 'end-of-copy marker corrupt' error from raw TSV loading
Flamegraphs are now published to gh-pages at:
  /benchmarks/profiles/<branch>/<commit-sha>/

Each profile directory contains:
- index_flamegraph.svg (interactive)
- query_flamegraph.svg (interactive)
- index_profile.txt
- query_profile.txt
- index.html (links to all files)

A master index at /benchmarks/profiles/ lists all profiles
organized by branch and commit.

The job summary now includes a direct link to the published
flamegraphs on GitHub Pages.
Clicking a data point now shows a popup menu with:
- View Commit (GitHub)
- Profile (main) - link to main branch profile
- Profile (block-storage-phase1) - link to dev branch profile
- Browse all profiles - link to profiles index

If a profile doesn't exist for that commit, user gets a 404
which is expected (not all commits have profiles).
1. Fix doc length truncation: Change TpDocMapEntry.doc_length from
   uint16 to uint32 to preserve precision when decoding/re-encoding
   fieldnorms during segment merge. Documents with >65535 terms were
   previously losing precision.

2. Add warnings for missing CTID in docmap: When tp_docmap_lookup
   returns UINT32_MAX (CTID not found), log a WARNING before falling
   back to doc_id 0. This should never happen in practice but if it
   does, the warning makes debugging easier.

3. Eliminate double hash lookups in segment writing: Restructure
   tp_write_segment_v2 and write_merged_segment to build block_postings
   once (including fieldnorm lookup) and reuse for both skip entry
   computation and posting block writing. This roughly halves the
   docmap hash lookups during segment creation.

4. Fix block_max_norm computation in merge: Previously hardcoded to 0,
   now properly computed from fieldnorms in the block.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
A missing CTID in the docmap indicates a data integrity bug that should
not be silently handled. Fail fast with a clear error message instead.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The formula-based encoder produced indices that didn't match the
Tantivy-derived decode table, causing massive BM25 scoring errors
(e.g., 1438% error for 1000-word documents). Replace with binary
search encoder that correctly finds the bucket for each length.
tjgreen42 and others added 12 commits December 18, 2025 23:13
- Document doc_length behavior when CTID already exists in docmap
- Add UINT32_MAX guard to prevent doc_id overflow (reserved as sentinel)
- Add finalized flag with assertions in getter functions
- Document return value of posting_source_load_block
- Clarify block buffer reallocation comment
During segment merge, each posting required a hash lookup to convert
CTID → new_doc_id. This showed up as ~10% of CPU time in profiling.

Replace with O(1) array lookups by building per-source mapping arrays
(old_doc_id → new_doc_id) during docmap construction. Also eliminate
unnecessary fieldnorm decode/re-encode by passing through the raw byte.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Show MS-MARCO benchmarks first (most interesting at scale)
- Filter out old metrics without "(X docs)" format
- Explicit ordering instead of relying on object key order

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace hash table lookups with binary search on a sorted array for
CTID → doc_id conversion in tp_write_segment_v2(). The previous
commit optimized the merge path; this optimizes the initial segment
write path.

Changes:
- Add TpCtidLookupEntry struct and ctid_sorted array to TpDocMapBuilder
- Build sorted CTID array in tp_docmap_finalize() for binary search
- Add tp_docmap_lookup_fast() using bsearch on the sorted array
- Use fast lookup in segment write inner loop

Profile showed tp_docmap_lookup at 4.31% of CPU during index build.
Binary search on contiguous sorted array has better cache locality
than hash table traversal.
- Update version numbers to match reality (v0.1.x not v0.0.x)
- Mark completed Phase 1 items: block storage, skip index, docmap, fieldnorms
- Add index build optimizations (direct mapping, binary search)
- Update ROADMAP.md with v0.1.1-dev highlights
- Skip 0.1.1-dev, go directly from 0.1.0 to 0.2.0-dev
- Update roadmaps with bigger version jumps: 0.2.0 -> 0.5.0 -> 1.0.0
- Block storage foundation (current) targets v0.2.0
- Query optimizations (BMW/MAXSCORE) target v0.5.0
- Production ready GA targets v1.0.0 (Feb 2025)
tjgreen42 and others added 7 commits December 19, 2025 18:15
The binary search optimization intended to eliminate hash lookup overhead
during segment writing actually caused an 8% slowdown in index build time
(671s -> 726s for MS MARCO 8.8M docs).

Root cause: For high-volume lookups (8.8M docs * 30 terms = 264M lookups),
O(log n) binary search (23 comparisons per lookup) is slower than O(1)
hash lookup, even with hash overhead.

This reverts the sorted CTID array and tp_docmap_lookup_fast() function,
keeping the simpler and faster hash-based tp_docmap_lookup().

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change format from [dict] -> [skip index] -> [postings] to
[dict] -> [postings] -> [skip index]. This enables single-pass
streaming writes instead of 9 passes during segment creation.

The new format writes postings immediately after dict entries,
accumulating skip index entries in memory, then writes the skip
index at the end. Dict entries are updated with correct offsets
after all postings are written.

This addresses ~30% performance regression seen in segment merge
benchmarks by eliminating multiple passes over the data.
The V2 segment format's dict entry update code was writing 12-byte
TpDictEntryV2 entries without handling the case where an entry spans
two pages. When page_offset > 8156 (SEGMENT_DATA_PER_PAGE - sizeof),
the write would overflow into the next page's header, corrupting the
LSN field and causing 'xlog flush request not satisfied' errors.

Fixed by detecting when an entry would cross a page boundary and
splitting the write across two pages. Applied to both tp_write_segment_v2()
and the merge segment creation code.
Change the default memtable_spill_threshold from 800K to 35M posting
entries. This creates segments with ~1M documents each, similar to
Tantivy's default segment size, improving query performance by
reducing the number of segments to search.
- Default memtable spill threshold now 32M posting entries (~1M docs/segment)
- Benchmark workflow uses segments_per_level=16 for fewer merges
- Add segment statistics collection to MS MARCO and Wikipedia benchmarks
@tjgreen42 tjgreen42 merged commit 15b8df3 into main Dec 22, 2025
13 checks passed
@tjgreen42 tjgreen42 deleted the block-storage-phase1 branch December 22, 2025 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant