Fix V2 segment query performance regression by tjgreen42 · Pull Request #87 · timescale/pg_textsearch

tjgreen42 · 2025-12-22T22:10:44Z

Largely fixes query performance regression from V2 segment format.

Changes:

O(T×S) → O(S) segment opens - Open each segment once for all terms instead of once per term
Zero-copy block iteration - Use direct buffer access instead of copying ~1KB per block
Split CTID arrays - Separate block/offset arrays for cache-efficient doc_id → CTID lookups
Top-k partial sort - O(n + k log k) instead of O(n log n) for result extraction
Custom hash table - Replace dynahash with linear-probing hash table (was 70% of query time)
CTID tiebreaker - Deterministic ordering for equal scores
Fix use-after-free in segment merge - During N-way term merge, comparisons used min_term which pointed to freed memory after advancing sources. This caused documents with terms in multiple segments to be lost, corrupting corpus statistics and BM25 scores (~29% off). Fixed by comparing against the pstrdup'd copy.

MS MARCO (8.8M docs): Short 16ms, Medium 23ms, Long 29ms

Closes #86
Fixes #93

The V2 segment format introduced a significant query performance regression where segments were being opened/closed O(T × S) times for T query terms and S segments. Root cause: The scoring loop opened each segment once per term: for each term: for each segment: open → lookup → close Fix: Restructure to open each segment once for all terms: for each segment: open for each term: lookup and score close This reduces segment opens from O(T × S) to O(S), which is critical because tp_segment_open reads the segment header and entire page index from disk. New function tp_score_all_terms_in_segment_chain combines doc_freq lookup and scoring into a single pass per segment.

- Split query benchmarks into two modes: - Without scores: ORDER BY only (index scan) - With scores: SELECT score column (per-row operator call) - Increase profiling iterations from 100 to 300 per query type - Add 600 iterations total (300 no-score + 300 with-score) This helps distinguish between index scan performance and per-row scoring operator performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Change V2 segment posting iterator to use tp_segment_get_direct for block data instead of tp_segment_read. This eliminates costly memcpy operations when block data fits within a single page (the common case). Key changes: - Add TpSegmentDirectAccess to V2 iterator for zero-copy block access - Try direct access first, fall back to copy when block spans pages - Properly release direct access when advancing blocks or finishing Profile showed V2 was spending ~27% CPU in tp_segment_read path (13.96% memcpy + 9.27% buffer manager + 3.5% read overhead) vs V1's ~2.5% using zero-copy access. This should significantly reduce that. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The previous commit added zero-copy for block loading, but CTID lookups were still using tp_segment_read for every posting (when segment has >100K docs). Profile showed this was still ~12% of CPU time. Now use tp_segment_get_direct for CTID lookups too, with fallback to copy only when CTID spans page boundary (rare - 6 byte entries). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Support both old (5 queries) and new (10 queries) format - Extract no-score and with-score metrics separately - Fix throughput extraction for new THROUGHPUT_RESULT_NO_SCORE format - Add with-score metrics to benchmark action output 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When loading a block of postings, batch-load all CTIDs for documents in that block. This reduces per-posting overhead during iteration, though the total buffer accesses remain similar due to scattered doc_ids in the CTID map. Performance on MS-MARCO 8.8M docs (warm cache, -O2): - V1: ~65ms for multi-term query - V2: ~125ms (still ~2x slower due to CTID indirection) The remaining gap is architectural: V1 stores CTIDs inline with postings, while V2 uses a separate CTID map requiring random access.

Replace block-level CTID caching with segment-level split arrays: - Store BlockNumber[] and OffsetNumber[] separately (6 bytes/doc total) - Load both arrays into memory when opening segment - Simple array indexing for CTID lookup during iteration - Excellent cache locality since posting lists are sorted by doc_id This eliminates per-posting buffer manager calls that caused V2 to be slower than V1. The split storage also provides better packing than the previous ItemPointerData[] approach.

UBSAN detected misaligned access when reading TpBlockPosting from page buffers via zero-copy direct access. TpBlockPosting contains a uint32 field requiring 4-byte alignment, but the data position within the page may not be properly aligned. Fix: Check alignment before using zero-copy access. If misaligned, fall back to copying the block data to a properly-aligned buffer. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace qsort with a custom partial quicksort that only sorts the top-k elements needed for results. This is O(n + k log k) instead of O(n log n), providing significant speedup when k << n. Key optimizations: - Partial quicksort only recurses into partitions containing top-k - Inlined score comparison eliminates function call overhead - Insertion sort for small subarrays (< 16 elements) - Median-of-three pivot selection for better partitioning Performance improvement: ~40% faster for multi-term queries on MS-MARCO (8.8M docs): 565ms -> 340ms for 5-term query. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When documents have equal BM25 scores, sort by CTID ascending to ensure deterministic, reproducible query results. This makes the partial quicksort stable for equal-scored documents. Update limits test expected output to reflect the new deterministic ordering. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Profile showed 70% of query time in hash operations (hash_search 38%, hash_seq_search 32%). New TpDocScoreTable uses: - Open-addressing with linear probing for better cache locality - 64-bit packed CTID keys for fast hashing - FNV-1a hash function - Simple array iteration instead of bucket traversal Performance improved from ~340ms to ~40-75ms for 5-term queries.

The FNV-1a hash was only processing bytes 0-5 (bits 0-47), missing the high 16 bits of block number (bits 48-63). For tables > 512MB, this would cause hash collisions for CTIDs differing only in those bits. Also fixed incorrect comment claiming 24-byte entry size (it's 16).

Add a planner hook that replaces resjunk ORDER BY score expressions with a cheap stub function that returns the cached score from the index scan. This avoids computing BM25 scores twice (once in the index scan, once in expression evaluation). - Add tp_cached_score infrastructure in index.c - Add bm25_get_current_score() stub function in query.c - Add planner hook to detect BM25 IndexScan and replace expressions - Only replace when BM25 IndexScan is present (safe for SeqScan fallback) Also add fieldnorm_discrepancy test documenting score corruption after L0->L1 merge (tracked in issue #93).

Revert commits 014207f and a422e57 that introduced a custom linear-probing hash table. Profiling showed tp_doc_score_table_insert taking ~50% of query time, and the custom FNV-1a hash was doing 8 multiplies per insert. PostgreSQL's built-in dynahash is well-tested and optimized. Stick with it unless we have compelling evidence it's a bottleneck.

The tp_cached_score_valid flag and related functions were never used. Simplify to just the cached score value.

Previously the planner hook only replaced resjunk ORDER BY expressions with the cached score stub. Now it also replaces matching expressions in the SELECT clause. For queries like: SELECT content <@> 'hello' AS score FROM t ORDER BY content <@> 'hello' Both the SELECT and ORDER BY expressions are replaced with bm25_get_current_score(), avoiding duplicate score computation.

BM25 queries should be driven by ORDER BY + LIMIT, not filtered with WHERE score < 0. The WHERE clause was pointless (matches nearly all rows since scores are non-positive) and just added overhead. Cleaned up benchmarks and tests to use the correct pattern: SELECT ... ORDER BY col <@> query LIMIT n 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds parallel benchmarks that include score in the SELECT clause: - Benchmark 1b: Query latency with score (5 query types × 10 iterations) - Benchmark 2b: Query throughput with score (20 queries) The extraction script already supported with-score metrics. Updated format_for_action.sh to include all with-score query latencies in the graphs. This lets us measure the impact of the stub optimization that replaces duplicate score expressions in SELECT with a cached value lookup. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Without LIMIT, PostgreSQL returns all rows sorted by score. The index only provides matching documents - non-matching ones get score 0 from standalone scoring. Adding LIMIT allows the index to satisfy the query entirely. Also added SET enable_seqscan = off for consistency with other tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When the SELECT clause uses ROUND((score)::numeric, 4) as score, ORDER BY score refers to the ROUND() output, not the raw operator. The index can't provide ordering by ROUND(score), only by the raw score expression. Solution: ORDER BY the raw expression while displaying ROUND() in SELECT. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The previous change incorrectly removed WHERE clauses from COUNT queries, replacing them with ORDER BY, which is invalid SQL (can't ORDER BY a column not in SELECT when using aggregates). Fix by using subquery pattern with ORDER BY + LIMIT, which uses the index. WHERE clauses are unnecessary since the index only returns matching documents. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

The test reveals that document id=1 from the first segment is lost during merge for 'database' and 'world' terms (though 'hello' still works). Updated comments to accurately describe this bug instead of the outdated comment about "Query code only searches L0 segments". 🤖 Generated with [Claude Code](https://claude.com/claude-code)

During N-way term merge, comparisons used min_term which pointed to freed memory after advancing sources. This caused documents with terms in multiple segments to be lost, corrupting corpus statistics and BM25 scores. Fix: Compare against current_merged->term (the pstrdup'd copy) instead. Fixes #93

Queries without LIMIT were scanning all rows and showing 0 scores for non-matching documents. Changed to use ORDER BY expression + LIMIT pattern which uses the BM25 index and only returns matching documents. Files fixed: - test/sql/partitioned.sql - test/sql/mixed.sql - test/sql/vector.sql

tjgreen42 and others added 11 commits December 22, 2025 14:09

Fix iterator field names

2c5a274

tjgreen42 force-pushed the work2 branch from 2b9fe6f to 2e60e1e Compare December 24, 2025 04:08

tjgreen42 added 3 commits December 23, 2025 20:38

Fix expected test output for deterministic CTID ordering

a2d302b

tjgreen42 changed the title ~~Fix O(terms × segments) query performance regression~~ Fix V2 segment query performance and optimize BM25 scoring Dec 24, 2025

tjgreen42 and others added 9 commits December 25, 2025 17:50

Remove fieldnorm_discrepancy test (not part of this PR)

6210329

Remove unused score validity tracking

7852c86

The tp_cached_score_valid flag and related functions were never used. Simplify to just the cached score value.

tjgreen42 force-pushed the work2 branch from 60d4c5e to 1da4c40 Compare December 28, 2025 17:22

tjgreen42 force-pushed the work2 branch from 04b4fc9 to a2f4591 Compare December 28, 2025 17:51

tjgreen42 added 2 commits December 28, 2025 09:59

tjgreen42 changed the title ~~Fix V2 segment query performance and optimize BM25 scoring~~ Fix V2 segment query performance regression Dec 28, 2025

tjgreen42 merged commit f3c6e14 into main Dec 28, 2025
11 checks passed

tjgreen42 deleted the work2 branch December 28, 2025 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix V2 segment query performance regression#87

Fix V2 segment query performance regression#87
tjgreen42 merged 27 commits intomainfrom
work2

tjgreen42 commented Dec 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tjgreen42 commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tjgreen42 commented Dec 22, 2025 •

edited

Loading