Conversation
The V2 segment format introduced a significant query performance regression
where segments were being opened/closed O(T × S) times for T query terms
and S segments.
Root cause: The scoring loop opened each segment once per term:
for each term:
for each segment:
open → lookup → close
Fix: Restructure to open each segment once for all terms:
for each segment:
open
for each term: lookup and score
close
This reduces segment opens from O(T × S) to O(S), which is critical
because tp_segment_open reads the segment header and entire page index
from disk.
New function tp_score_all_terms_in_segment_chain combines doc_freq
lookup and scoring into a single pass per segment.
- Split query benchmarks into two modes: - Without scores: ORDER BY only (index scan) - With scores: SELECT score column (per-row operator call) - Increase profiling iterations from 100 to 300 per query type - Add 600 iterations total (300 no-score + 300 with-score) This helps distinguish between index scan performance and per-row scoring operator performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change V2 segment posting iterator to use tp_segment_get_direct for block data instead of tp_segment_read. This eliminates costly memcpy operations when block data fits within a single page (the common case). Key changes: - Add TpSegmentDirectAccess to V2 iterator for zero-copy block access - Try direct access first, fall back to copy when block spans pages - Properly release direct access when advancing blocks or finishing Profile showed V2 was spending ~27% CPU in tp_segment_read path (13.96% memcpy + 9.27% buffer manager + 3.5% read overhead) vs V1's ~2.5% using zero-copy access. This should significantly reduce that. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous commit added zero-copy for block loading, but CTID lookups were still using tp_segment_read for every posting (when segment has >100K docs). Profile showed this was still ~12% of CPU time. Now use tp_segment_get_direct for CTID lookups too, with fallback to copy only when CTID spans page boundary (rare - 6 byte entries). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Support both old (5 queries) and new (10 queries) format - Extract no-score and with-score metrics separately - Fix throughput extraction for new THROUGHPUT_RESULT_NO_SCORE format - Add with-score metrics to benchmark action output 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When loading a block of postings, batch-load all CTIDs for documents in that block. This reduces per-posting overhead during iteration, though the total buffer accesses remain similar due to scattered doc_ids in the CTID map. Performance on MS-MARCO 8.8M docs (warm cache, -O2): - V1: ~65ms for multi-term query - V2: ~125ms (still ~2x slower due to CTID indirection) The remaining gap is architectural: V1 stores CTIDs inline with postings, while V2 uses a separate CTID map requiring random access.
Replace block-level CTID caching with segment-level split arrays: - Store BlockNumber[] and OffsetNumber[] separately (6 bytes/doc total) - Load both arrays into memory when opening segment - Simple array indexing for CTID lookup during iteration - Excellent cache locality since posting lists are sorted by doc_id This eliminates per-posting buffer manager calls that caused V2 to be slower than V1. The split storage also provides better packing than the previous ItemPointerData[] approach.
UBSAN detected misaligned access when reading TpBlockPosting from page buffers via zero-copy direct access. TpBlockPosting contains a uint32 field requiring 4-byte alignment, but the data position within the page may not be properly aligned. Fix: Check alignment before using zero-copy access. If misaligned, fall back to copying the block data to a properly-aligned buffer. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace qsort with a custom partial quicksort that only sorts the top-k elements needed for results. This is O(n + k log k) instead of O(n log n), providing significant speedup when k << n. Key optimizations: - Partial quicksort only recurses into partitions containing top-k - Inlined score comparison eliminates function call overhead - Insertion sort for small subarrays (< 16 elements) - Median-of-three pivot selection for better partitioning Performance improvement: ~40% faster for multi-term queries on MS-MARCO (8.8M docs): 565ms -> 340ms for 5-term query. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When documents have equal BM25 scores, sort by CTID ascending to ensure deterministic, reproducible query results. This makes the partial quicksort stable for equal-scored documents. Update limits test expected output to reflect the new deterministic ordering. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Profile showed 70% of query time in hash operations (hash_search 38%, hash_seq_search 32%). New TpDocScoreTable uses: - Open-addressing with linear probing for better cache locality - 64-bit packed CTID keys for fast hashing - FNV-1a hash function - Simple array iteration instead of bucket traversal Performance improved from ~340ms to ~40-75ms for 5-term queries.
The FNV-1a hash was only processing bytes 0-5 (bits 0-47), missing the high 16 bits of block number (bits 48-63). For tables > 512MB, this would cause hash collisions for CTIDs differing only in those bits. Also fixed incorrect comment claiming 24-byte entry size (it's 16).
Add a planner hook that replaces resjunk ORDER BY score expressions with a cheap stub function that returns the cached score from the index scan. This avoids computing BM25 scores twice (once in the index scan, once in expression evaluation). - Add tp_cached_score infrastructure in index.c - Add bm25_get_current_score() stub function in query.c - Add planner hook to detect BM25 IndexScan and replace expressions - Only replace when BM25 IndexScan is present (safe for SeqScan fallback) Also add fieldnorm_discrepancy test documenting score corruption after L0->L1 merge (tracked in issue #93).
Revert commits 014207f and a422e57 that introduced a custom linear-probing hash table. Profiling showed tp_doc_score_table_insert taking ~50% of query time, and the custom FNV-1a hash was doing 8 multiplies per insert. PostgreSQL's built-in dynahash is well-tested and optimized. Stick with it unless we have compelling evidence it's a bottleneck.
The tp_cached_score_valid flag and related functions were never used. Simplify to just the cached score value.
Previously the planner hook only replaced resjunk ORDER BY expressions with the cached score stub. Now it also replaces matching expressions in the SELECT clause. For queries like: SELECT content <@> 'hello' AS score FROM t ORDER BY content <@> 'hello' Both the SELECT and ORDER BY expressions are replaced with bm25_get_current_score(), avoiding duplicate score computation.
BM25 queries should be driven by ORDER BY + LIMIT, not filtered with WHERE score < 0. The WHERE clause was pointless (matches nearly all rows since scores are non-positive) and just added overhead. Cleaned up benchmarks and tests to use the correct pattern: SELECT ... ORDER BY col <@> query LIMIT n 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds parallel benchmarks that include score in the SELECT clause: - Benchmark 1b: Query latency with score (5 query types × 10 iterations) - Benchmark 2b: Query throughput with score (20 queries) The extraction script already supported with-score metrics. Updated format_for_action.sh to include all with-score query latencies in the graphs. This lets us measure the impact of the stub optimization that replaces duplicate score expressions in SELECT with a cached value lookup. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Without LIMIT, PostgreSQL returns all rows sorted by score. The index only provides matching documents - non-matching ones get score 0 from standalone scoring. Adding LIMIT allows the index to satisfy the query entirely. Also added SET enable_seqscan = off for consistency with other tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When the SELECT clause uses ROUND((score)::numeric, 4) as score, ORDER BY score refers to the ROUND() output, not the raw operator. The index can't provide ordering by ROUND(score), only by the raw score expression. Solution: ORDER BY the raw expression while displaying ROUND() in SELECT. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous change incorrectly removed WHERE clauses from COUNT queries, replacing them with ORDER BY, which is invalid SQL (can't ORDER BY a column not in SELECT when using aggregates). Fix by using subquery pattern with ORDER BY + LIMIT, which uses the index. WHERE clauses are unnecessary since the index only returns matching documents. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
The test reveals that document id=1 from the first segment is lost during merge for 'database' and 'world' terms (though 'hello' still works). Updated comments to accurately describe this bug instead of the outdated comment about "Query code only searches L0 segments". 🤖 Generated with [Claude Code](https://claude.com/claude-code)
During N-way term merge, comparisons used min_term which pointed to freed memory after advancing sources. This caused documents with terms in multiple segments to be lost, corrupting corpus statistics and BM25 scores. Fix: Compare against current_merged->term (the pstrdup'd copy) instead. Fixes #93
Queries without LIMIT were scanning all rows and showing 0 scores for non-matching documents. Changed to use ORDER BY expression + LIMIT pattern which uses the BM25 index and only returns matching documents. Files fixed: - test/sql/partitioned.sql - test/sql/mixed.sql - test/sql/vector.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Largely fixes query performance regression from V2 segment format.
Changes:
O(T×S) → O(S) segment opens - Open each segment once for all terms instead of once per term
Zero-copy block iteration - Use direct buffer access instead of copying ~1KB per block
Split CTID arrays - Separate block/offset arrays for cache-efficient doc_id → CTID lookups
Top-k partial sort - O(n + k log k) instead of O(n log n) for result extraction
Custom hash table - Replace dynahash with linear-probing hash table (was 70% of query time)
CTID tiebreaker - Deterministic ordering for equal scores
Fix use-after-free in segment merge - During N-way term merge, comparisons used
min_termwhich pointed to freed memory after advancing sources. This caused documents with terms in multiple segments to be lost, corrupting corpus statistics and BM25 scores (~29% off). Fixed by comparing against the pstrdup'd copy.MS MARCO (8.8M docs): Short 16ms, Medium 23ms, Long 29ms
Closes #86
Fixes #93