Implement V2 segment format with block storage for BMW optimization by tjgreen42 · Pull Request #81 · timescale/pg_textsearch

tjgreen42 · 2025-12-17T21:26:57Z

Summary

Implements Phase 1 of block storage optimization with V2 segment format
Block-based posting storage (128 docs/block) with skip index for future BMW optimization
Fieldnorm table using Lucene SmallFloat encoding (1 byte per doc)
Segment-local document IDs with CTID map for heap tuple lookup
All segment writes switched to V2 format (breaking change, metapage version 5)

Compatibility

Read: Supports both V1 and V2 segments (query code branches on header->version)
Write: V2 only - all new segments use V2 format
Upgrade path: Existing indexes must be rebuilt (standard for pre-1.0 releases)

Key Changes

New files: docmap.c/h (CTID mapping), fieldnorm.h (length quantization)
segment_merge.c: Complete rewrite to read V2 sources and write V2 output
segment_query.c: V2 query iterator with block-based iteration
segment.c: V2 writer and dump function updates

Format Details

TpBlockPosting (8 bytes): doc_id + frequency (down from 14-byte TpSegmentPosting)
TpSkipEntry (16 bytes): per-block statistics for BMW block skipping
TpDictEntryV2 (12 bytes): skip_index_offset + block_count + doc_freq

Configuration Changes

Memtable spill threshold: Bumped from 800K to 32M posting entries (~1M docs/segment)
Benchmark config: segments_per_level = 16 for fewer merges during benchmarking
Segment statistics: Added bm25_summarize_index() calls to benchmark load scripts

Testing

All 29 regression tests pass
Merge, segment, and scoring tests validate V2 format correctness

This implements Phase 1 of the block storage optimization plan: V2 segment format changes: - Block-based posting storage: 128 docs per block with TpBlockPosting (8 bytes) - Skip index with TpSkipEntry (16 bytes) enabling BMW block skipping - Fieldnorm table using Lucene SmallFloat encoding (1 byte per doc) - CTID map for segment-local doc_id → heap tuple lookup - TpDictEntryV2 (12 bytes) with skip_index_offset + block_count New infrastructure: - docmap.c/h: Document ID mapping with CTID hash table - fieldnorm.h: Lucene SmallFloat encode/decode with precomputed table - segment_query.c: V2 query iterator with block-based iteration Key changes: - All segment writers switched from V1 to V2 (index build, spill, merge) - Merge completely rewritten to read V2 sources and write V2 output - Dump function updated to display V2 format details - Metapage version bumped to 5 (breaking change) The V2 format reduces storage (posting: 14→8 bytes) and enables future BMW optimization by providing block-level statistics for early termination during top-k queries. V1 code retained but unused - will be removed in future cleanup.

V2 segment queries were slow because the iterator performed per-posting I/O to look up CTIDs and fieldnorms. This change preloads both tables into memory when opening a V2 segment reader. Performance improvement on 50K document benchmark: - Common term queries: 2.1-2.6x faster - Buffer hits reduced by 23-34x (e.g., 30K → 900) Memory overhead per segment reader: - Fieldnorm: 1 byte per document - CTID map: 6 bytes per document - Total: ~7 bytes per document (e.g., 700KB for 100K docs) The caching remains beneficial with BMW optimization since we still need CTIDs for result output and fieldnorms for scoring on blocks that aren't skipped.

The unconditional caching caused severe regression on large segments (MS MARCO 8.8M docs was 5-7x slower) because: - Loading 60MB of cache data upfront is expensive - Top-k queries only access a small fraction of documents - PostgreSQL's buffer cache efficiently handles sparse access patterns With 100K threshold: - Small segments (flush, early compaction): cached for fast iteration - Large segments (late compaction): use per-posting reads via buffer cache

Document the fieldnorm storage strategy for V2 block storage: - Uncompressed format stores fieldnorm inline in TpBlockPosting (uses existing padding bytes, zero size overhead) - Documents observed high buffer manager overhead (~300-500ns per lookup) that motivated this change - Compressed formats will use separate fieldnorm table with caching (inline would cause 40-80% size inflation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Eliminates per-posting fieldnorm lookups during query execution by storing the encoded fieldnorm directly in each TpBlockPosting entry. This uses existing padding bytes, so there's zero storage overhead. Changes: - TpBlockPosting: uint8 fieldnorm replaces part of reserved padding - V2 segment writer: populates fieldnorm from docmap during write - V2 segment reader: reads fieldnorm directly from posting - Segment merge: reads/writes inline fieldnorm during merge - Removed cached_fieldnorms from TpSegmentReader (no longer needed) The separate fieldnorm table is still written for backward compatibility and potential future use with compressed formats.

Adds a separate 'profile' job that runs on manual dispatch: - Uses perf to capture CPU profiles during index creation - Generates flamegraph.svg for visual analysis - Uses MS MARCO 100K subset for manageable profile size - Uploads profile artifacts for download - Shows top functions in job summary The profile job runs in parallel with the main benchmark job but produces different artifacts focused on identifying performance bottlenecks rather than timing measurements.

- Add enable_profile input (default: false) to opt-in to profiling - Use selected msmarco_size for profiling (not hardcoded 100K) - Increase timeout to 2 hours for full dataset profiling - Increase perf recording duration to 60 minutes - Increase memory settings for larger datasets - Add disk space cleanup for full MS MARCO download - Show dataset size in profile summary

Profile job now captures two separate flamegraphs: - index_flamegraph.svg: CPU profile during index creation - query_flamegraph.svg: CPU profile during query execution Query profiling runs 300 iterations of short/medium/long queries at 499 Hz sampling to capture meaningful query-path bottlenecks. Artifacts uploaded: - index_flamegraph.svg, query_flamegraph.svg (interactive SVGs) - index_profile.txt, query_profile.txt (perf reports) - index_perf.folded, query_perf.folded (raw stacks)

- Use head -n 75 of load.sql to get proper data loading with awk-based CSV conversion (handles special characters) - Fix column name: passage_text (not passage) - Fixes 'end-of-copy marker corrupt' error from raw TSV loading

Flamegraphs are now published to gh-pages at: /benchmarks/profiles/<branch>/<commit-sha>/ Each profile directory contains: - index_flamegraph.svg (interactive) - query_flamegraph.svg (interactive) - index_profile.txt - query_profile.txt - index.html (links to all files) A master index at /benchmarks/profiles/ lists all profiles organized by branch and commit. The job summary now includes a direct link to the published flamegraphs on GitHub Pages.

Clicking a data point now shows a popup menu with: - View Commit (GitHub) - Profile (main) - link to main branch profile - Profile (block-storage-phase1) - link to dev branch profile - Browse all profiles - link to profiles index If a profile doesn't exist for that commit, user gets a 404 which is expected (not all commits have profiles).

1. Fix doc length truncation: Change TpDocMapEntry.doc_length from uint16 to uint32 to preserve precision when decoding/re-encoding fieldnorms during segment merge. Documents with >65535 terms were previously losing precision. 2. Add warnings for missing CTID in docmap: When tp_docmap_lookup returns UINT32_MAX (CTID not found), log a WARNING before falling back to doc_id 0. This should never happen in practice but if it does, the warning makes debugging easier. 3. Eliminate double hash lookups in segment writing: Restructure tp_write_segment_v2 and write_merged_segment to build block_postings once (including fieldnorm lookup) and reuse for both skip entry computation and posting block writing. This roughly halves the docmap hash lookups during segment creation. 4. Fix block_max_norm computation in merge: Previously hardcoded to 0, now properly computed from fieldnorms in the block. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

A missing CTID in the docmap indicates a data integrity bug that should not be silently handled. Fail fast with a clear error message instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The formula-based encoder produced indices that didn't match the Tantivy-derived decode table, causing massive BM25 scoring errors (e.g., 1438% error for 1000-word documents). Replace with binary search encoder that correctly finds the bucket for each length.

src/segment/docmap.h

src/segment/segment_merge.c

- Document doc_length behavior when CTID already exists in docmap - Add UINT32_MAX guard to prevent doc_id overflow (reserved as sentinel) - Add finalized flag with assertions in getter functions - Document return value of posting_source_load_block - Clarify block buffer reallocation comment

During segment merge, each posting required a hash lookup to convert CTID → new_doc_id. This showed up as ~10% of CPU time in profiling. Replace with O(1) array lookups by building per-source mapping arrays (old_doc_id → new_doc_id) during docmap construction. Also eliminate unnecessary fieldnorm decode/re-encode by passing through the raw byte. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Show MS-MARCO benchmarks first (most interesting at scale) - Filter out old metrics without "(X docs)" format - Explicit ordering instead of relying on object key order 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace hash table lookups with binary search on a sorted array for CTID → doc_id conversion in tp_write_segment_v2(). The previous commit optimized the merge path; this optimizes the initial segment write path. Changes: - Add TpCtidLookupEntry struct and ctid_sorted array to TpDocMapBuilder - Build sorted CTID array in tp_docmap_finalize() for binary search - Add tp_docmap_lookup_fast() using bsearch on the sorted array - Use fast lookup in segment write inner loop Profile showed tp_docmap_lookup at 4.31% of CPU during index build. Binary search on contiguous sorted array has better cache locality than hash table traversal.

- Update version numbers to match reality (v0.1.x not v0.0.x) - Mark completed Phase 1 items: block storage, skip index, docmap, fieldnorms - Add index build optimizations (direct mapping, binary search) - Update ROADMAP.md with v0.1.1-dev highlights

- Skip 0.1.1-dev, go directly from 0.1.0 to 0.2.0-dev - Update roadmaps with bigger version jumps: 0.2.0 -> 0.5.0 -> 1.0.0 - Block storage foundation (current) targets v0.2.0 - Query optimizations (BMW/MAXSCORE) target v0.5.0 - Production ready GA targets v1.0.0 (Feb 2025)

The binary search optimization intended to eliminate hash lookup overhead during segment writing actually caused an 8% slowdown in index build time (671s -> 726s for MS MARCO 8.8M docs). Root cause: For high-volume lookups (8.8M docs * 30 terms = 264M lookups), O(log n) binary search (23 comparisons per lookup) is slower than O(1) hash lookup, even with hash overhead. This reverts the sorted CTID array and tp_docmap_lookup_fast() function, keeping the simpler and faster hash-based tp_docmap_lookup(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Change format from [dict] -> [skip index] -> [postings] to [dict] -> [postings] -> [skip index]. This enables single-pass streaming writes instead of 9 passes during segment creation. The new format writes postings immediately after dict entries, accumulating skip index entries in memory, then writes the skip index at the end. Dict entries are updated with correct offsets after all postings are written. This addresses ~30% performance regression seen in segment merge benchmarks by eliminating multiple passes over the data.

The V2 segment format's dict entry update code was writing 12-byte TpDictEntryV2 entries without handling the case where an entry spans two pages. When page_offset > 8156 (SEGMENT_DATA_PER_PAGE - sizeof), the write would overflow into the next page's header, corrupting the LSN field and causing 'xlog flush request not satisfied' errors. Fixed by detecting when an entry would cross a page boundary and splitting the write across two pages. Applied to both tp_write_segment_v2() and the merge segment creation code.

Change the default memtable_spill_threshold from 800K to 35M posting entries. This creates segments with ~1M documents each, similar to Tantivy's default segment size, improving query performance by reducing the number of segments to search.

- Default memtable spill threshold now 32M posting entries (~1M docs/segment) - Benchmark workflow uses segments_per_level=16 for fewer merges - Add segment statistics collection to MS MARCO and Wikipedia benchmarks

tjgreen42 and others added 17 commits December 17, 2025 13:26

Fix profile job data loading and column names

f83a53c

- Use head -n 75 of load.sql to get proper data loading with awk-based CSV conversion (handles special characters) - Fix column name: passage_text (not passage) - Fixes 'end-of-copy marker corrupt' error from raw TSV loading

Add link to CPU profiles from benchmark dashboard

c292d20

Fix load.sql cut point - use line 91 (before index creation)

ff16a2a

Simplify graph click menu - single 'Find Profile' link to index

a8f4817

tjgreen42 commented Dec 19, 2025

View reviewed changes

tjgreen42 and others added 12 commits December 18, 2025 23:13

Clarify tp_docmap_create is constructor

c39a2af

Display Index Build Time in seconds instead of ms

48338a4

Lighten chart fill opacity

afd09f3

Fix chart fill: use rgba() for proper transparency

e37f697

Fix version numbering: query optimizations is v0.3.0

e20db39

Add v0.4.0 for compression

bb55c7b

tjgreen42 and others added 7 commits December 19, 2025 18:15

Tune segment configuration for better benchmarking

3380f6f

- Default memtable spill threshold now 32M posting entries (~1M docs/segment) - Benchmark workflow uses segments_per_level=16 for fewer merges - Add segment statistics collection to MS MARCO and Wikipedia benchmarks

Improve profile index: sort by date, show commit messages

26dbad6

Fix query profiling: disable autovacuum, filter background workers

b992ef4

tjgreen42 merged commit 15b8df3 into main Dec 22, 2025
13 checks passed

tjgreen42 deleted the block-storage-phase1 branch December 22, 2025 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement V2 segment format with block storage for BMW optimization#81

Implement V2 segment format with block storage for BMW optimization#81
tjgreen42 merged 36 commits intomainfrom
block-storage-phase1

tjgreen42 commented Dec 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tjgreen42 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Compatibility

Key Changes

Format Details

Configuration Changes

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tjgreen42 commented Dec 17, 2025 •

edited

Loading