Implement V2 segment format with block storage for BMW optimization#81
Merged
Implement V2 segment format with block storage for BMW optimization#81
Conversation
This implements Phase 1 of the block storage optimization plan: V2 segment format changes: - Block-based posting storage: 128 docs per block with TpBlockPosting (8 bytes) - Skip index with TpSkipEntry (16 bytes) enabling BMW block skipping - Fieldnorm table using Lucene SmallFloat encoding (1 byte per doc) - CTID map for segment-local doc_id → heap tuple lookup - TpDictEntryV2 (12 bytes) with skip_index_offset + block_count New infrastructure: - docmap.c/h: Document ID mapping with CTID hash table - fieldnorm.h: Lucene SmallFloat encode/decode with precomputed table - segment_query.c: V2 query iterator with block-based iteration Key changes: - All segment writers switched from V1 to V2 (index build, spill, merge) - Merge completely rewritten to read V2 sources and write V2 output - Dump function updated to display V2 format details - Metapage version bumped to 5 (breaking change) The V2 format reduces storage (posting: 14→8 bytes) and enables future BMW optimization by providing block-level statistics for early termination during top-k queries. V1 code retained but unused - will be removed in future cleanup.
V2 segment queries were slow because the iterator performed per-posting I/O to look up CTIDs and fieldnorms. This change preloads both tables into memory when opening a V2 segment reader. Performance improvement on 50K document benchmark: - Common term queries: 2.1-2.6x faster - Buffer hits reduced by 23-34x (e.g., 30K → 900) Memory overhead per segment reader: - Fieldnorm: 1 byte per document - CTID map: 6 bytes per document - Total: ~7 bytes per document (e.g., 700KB for 100K docs) The caching remains beneficial with BMW optimization since we still need CTIDs for result output and fieldnorms for scoring on blocks that aren't skipped.
The unconditional caching caused severe regression on large segments (MS MARCO 8.8M docs was 5-7x slower) because: - Loading 60MB of cache data upfront is expensive - Top-k queries only access a small fraction of documents - PostgreSQL's buffer cache efficiently handles sparse access patterns With 100K threshold: - Small segments (flush, early compaction): cached for fast iteration - Large segments (late compaction): use per-posting reads via buffer cache
Document the fieldnorm storage strategy for V2 block storage: - Uncompressed format stores fieldnorm inline in TpBlockPosting (uses existing padding bytes, zero size overhead) - Documents observed high buffer manager overhead (~300-500ns per lookup) that motivated this change - Compressed formats will use separate fieldnorm table with caching (inline would cause 40-80% size inflation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Eliminates per-posting fieldnorm lookups during query execution by storing the encoded fieldnorm directly in each TpBlockPosting entry. This uses existing padding bytes, so there's zero storage overhead. Changes: - TpBlockPosting: uint8 fieldnorm replaces part of reserved padding - V2 segment writer: populates fieldnorm from docmap during write - V2 segment reader: reads fieldnorm directly from posting - Segment merge: reads/writes inline fieldnorm during merge - Removed cached_fieldnorms from TpSegmentReader (no longer needed) The separate fieldnorm table is still written for backward compatibility and potential future use with compressed formats.
Adds a separate 'profile' job that runs on manual dispatch: - Uses perf to capture CPU profiles during index creation - Generates flamegraph.svg for visual analysis - Uses MS MARCO 100K subset for manageable profile size - Uploads profile artifacts for download - Shows top functions in job summary The profile job runs in parallel with the main benchmark job but produces different artifacts focused on identifying performance bottlenecks rather than timing measurements.
- Add enable_profile input (default: false) to opt-in to profiling - Use selected msmarco_size for profiling (not hardcoded 100K) - Increase timeout to 2 hours for full dataset profiling - Increase perf recording duration to 60 minutes - Increase memory settings for larger datasets - Add disk space cleanup for full MS MARCO download - Show dataset size in profile summary
Profile job now captures two separate flamegraphs: - index_flamegraph.svg: CPU profile during index creation - query_flamegraph.svg: CPU profile during query execution Query profiling runs 300 iterations of short/medium/long queries at 499 Hz sampling to capture meaningful query-path bottlenecks. Artifacts uploaded: - index_flamegraph.svg, query_flamegraph.svg (interactive SVGs) - index_profile.txt, query_profile.txt (perf reports) - index_perf.folded, query_perf.folded (raw stacks)
- Use head -n 75 of load.sql to get proper data loading with awk-based CSV conversion (handles special characters) - Fix column name: passage_text (not passage) - Fixes 'end-of-copy marker corrupt' error from raw TSV loading
Flamegraphs are now published to gh-pages at: /benchmarks/profiles/<branch>/<commit-sha>/ Each profile directory contains: - index_flamegraph.svg (interactive) - query_flamegraph.svg (interactive) - index_profile.txt - query_profile.txt - index.html (links to all files) A master index at /benchmarks/profiles/ lists all profiles organized by branch and commit. The job summary now includes a direct link to the published flamegraphs on GitHub Pages.
Clicking a data point now shows a popup menu with: - View Commit (GitHub) - Profile (main) - link to main branch profile - Profile (block-storage-phase1) - link to dev branch profile - Browse all profiles - link to profiles index If a profile doesn't exist for that commit, user gets a 404 which is expected (not all commits have profiles).
1. Fix doc length truncation: Change TpDocMapEntry.doc_length from uint16 to uint32 to preserve precision when decoding/re-encoding fieldnorms during segment merge. Documents with >65535 terms were previously losing precision. 2. Add warnings for missing CTID in docmap: When tp_docmap_lookup returns UINT32_MAX (CTID not found), log a WARNING before falling back to doc_id 0. This should never happen in practice but if it does, the warning makes debugging easier. 3. Eliminate double hash lookups in segment writing: Restructure tp_write_segment_v2 and write_merged_segment to build block_postings once (including fieldnorm lookup) and reuse for both skip entry computation and posting block writing. This roughly halves the docmap hash lookups during segment creation. 4. Fix block_max_norm computation in merge: Previously hardcoded to 0, now properly computed from fieldnorms in the block. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
A missing CTID in the docmap indicates a data integrity bug that should not be silently handled. Fail fast with a clear error message instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The formula-based encoder produced indices that didn't match the Tantivy-derived decode table, causing massive BM25 scoring errors (e.g., 1438% error for 1000-word documents). Replace with binary search encoder that correctly finds the bucket for each length.
tjgreen42
commented
Dec 19, 2025
- Document doc_length behavior when CTID already exists in docmap - Add UINT32_MAX guard to prevent doc_id overflow (reserved as sentinel) - Add finalized flag with assertions in getter functions - Document return value of posting_source_load_block - Clarify block buffer reallocation comment
During segment merge, each posting required a hash lookup to convert CTID → new_doc_id. This showed up as ~10% of CPU time in profiling. Replace with O(1) array lookups by building per-source mapping arrays (old_doc_id → new_doc_id) during docmap construction. Also eliminate unnecessary fieldnorm decode/re-encode by passing through the raw byte. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Show MS-MARCO benchmarks first (most interesting at scale) - Filter out old metrics without "(X docs)" format - Explicit ordering instead of relying on object key order 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace hash table lookups with binary search on a sorted array for CTID → doc_id conversion in tp_write_segment_v2(). The previous commit optimized the merge path; this optimizes the initial segment write path. Changes: - Add TpCtidLookupEntry struct and ctid_sorted array to TpDocMapBuilder - Build sorted CTID array in tp_docmap_finalize() for binary search - Add tp_docmap_lookup_fast() using bsearch on the sorted array - Use fast lookup in segment write inner loop Profile showed tp_docmap_lookup at 4.31% of CPU during index build. Binary search on contiguous sorted array has better cache locality than hash table traversal.
- Update version numbers to match reality (v0.1.x not v0.0.x) - Mark completed Phase 1 items: block storage, skip index, docmap, fieldnorms - Add index build optimizations (direct mapping, binary search) - Update ROADMAP.md with v0.1.1-dev highlights
- Skip 0.1.1-dev, go directly from 0.1.0 to 0.2.0-dev - Update roadmaps with bigger version jumps: 0.2.0 -> 0.5.0 -> 1.0.0 - Block storage foundation (current) targets v0.2.0 - Query optimizations (BMW/MAXSCORE) target v0.5.0 - Production ready GA targets v1.0.0 (Feb 2025)
The binary search optimization intended to eliminate hash lookup overhead during segment writing actually caused an 8% slowdown in index build time (671s -> 726s for MS MARCO 8.8M docs). Root cause: For high-volume lookups (8.8M docs * 30 terms = 264M lookups), O(log n) binary search (23 comparisons per lookup) is slower than O(1) hash lookup, even with hash overhead. This reverts the sorted CTID array and tp_docmap_lookup_fast() function, keeping the simpler and faster hash-based tp_docmap_lookup(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change format from [dict] -> [skip index] -> [postings] to [dict] -> [postings] -> [skip index]. This enables single-pass streaming writes instead of 9 passes during segment creation. The new format writes postings immediately after dict entries, accumulating skip index entries in memory, then writes the skip index at the end. Dict entries are updated with correct offsets after all postings are written. This addresses ~30% performance regression seen in segment merge benchmarks by eliminating multiple passes over the data.
The V2 segment format's dict entry update code was writing 12-byte TpDictEntryV2 entries without handling the case where an entry spans two pages. When page_offset > 8156 (SEGMENT_DATA_PER_PAGE - sizeof), the write would overflow into the next page's header, corrupting the LSN field and causing 'xlog flush request not satisfied' errors. Fixed by detecting when an entry would cross a page boundary and splitting the write across two pages. Applied to both tp_write_segment_v2() and the merge segment creation code.
Change the default memtable_spill_threshold from 800K to 35M posting entries. This creates segments with ~1M documents each, similar to Tantivy's default segment size, improving query performance by reducing the number of segments to search.
- Default memtable spill threshold now 32M posting entries (~1M docs/segment) - Benchmark workflow uses segments_per_level=16 for fewer merges - Add segment statistics collection to MS MARCO and Wikipedia benchmarks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Compatibility
header->version)Key Changes
docmap.c/h(CTID mapping),fieldnorm.h(length quantization)Format Details
TpBlockPosting(8 bytes): doc_id + frequency (down from 14-byte TpSegmentPosting)TpSkipEntry(16 bytes): per-block statistics for BMW block skippingTpDictEntryV2(12 bytes): skip_index_offset + block_count + doc_freqConfiguration Changes
segments_per_level = 16for fewer merges during benchmarkingbm25_summarize_index()calls to benchmark load scriptsTesting