Skip to content

Fix V2 segment query performance regression#87

Merged
tjgreen42 merged 27 commits intomainfrom
work2
Dec 28, 2025
Merged

Fix V2 segment query performance regression#87
tjgreen42 merged 27 commits intomainfrom
work2

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

@tjgreen42 tjgreen42 commented Dec 22, 2025

Largely fixes query performance regression from V2 segment format.

Changes:

  1. O(T×S) → O(S) segment opens - Open each segment once for all terms instead of once per term

  2. Zero-copy block iteration - Use direct buffer access instead of copying ~1KB per block

  3. Split CTID arrays - Separate block/offset arrays for cache-efficient doc_id → CTID lookups

  4. Top-k partial sort - O(n + k log k) instead of O(n log n) for result extraction

  5. Custom hash table - Replace dynahash with linear-probing hash table (was 70% of query time)

  6. CTID tiebreaker - Deterministic ordering for equal scores

  7. Fix use-after-free in segment merge - During N-way term merge, comparisons used min_term which pointed to freed memory after advancing sources. This caused documents with terms in multiple segments to be lost, corrupting corpus statistics and BM25 scores (~29% off). Fixed by comparing against the pstrdup'd copy.

MS MARCO (8.8M docs): Short 16ms, Medium 23ms, Long 29ms

Closes #86
Fixes #93

tjgreen42 and others added 11 commits December 22, 2025 14:09
The V2 segment format introduced a significant query performance regression
where segments were being opened/closed O(T × S) times for T query terms
and S segments.

Root cause: The scoring loop opened each segment once per term:
  for each term:
    for each segment:
      open → lookup → close

Fix: Restructure to open each segment once for all terms:
  for each segment:
    open
    for each term: lookup and score
    close

This reduces segment opens from O(T × S) to O(S), which is critical
because tp_segment_open reads the segment header and entire page index
from disk.

New function tp_score_all_terms_in_segment_chain combines doc_freq
lookup and scoring into a single pass per segment.
- Split query benchmarks into two modes:
  - Without scores: ORDER BY only (index scan)
  - With scores: SELECT score column (per-row operator call)
- Increase profiling iterations from 100 to 300 per query type
- Add 600 iterations total (300 no-score + 300 with-score)

This helps distinguish between index scan performance and
per-row scoring operator performance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change V2 segment posting iterator to use tp_segment_get_direct for
block data instead of tp_segment_read. This eliminates costly memcpy
operations when block data fits within a single page (the common case).

Key changes:
- Add TpSegmentDirectAccess to V2 iterator for zero-copy block access
- Try direct access first, fall back to copy when block spans pages
- Properly release direct access when advancing blocks or finishing

Profile showed V2 was spending ~27% CPU in tp_segment_read path
(13.96% memcpy + 9.27% buffer manager + 3.5% read overhead) vs V1's
~2.5% using zero-copy access. This should significantly reduce that.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous commit added zero-copy for block loading, but CTID lookups
were still using tp_segment_read for every posting (when segment has
>100K docs). Profile showed this was still ~12% of CPU time.

Now use tp_segment_get_direct for CTID lookups too, with fallback to
copy only when CTID spans page boundary (rare - 6 byte entries).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Support both old (5 queries) and new (10 queries) format
- Extract no-score and with-score metrics separately
- Fix throughput extraction for new THROUGHPUT_RESULT_NO_SCORE format
- Add with-score metrics to benchmark action output

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When loading a block of postings, batch-load all CTIDs for documents
in that block. This reduces per-posting overhead during iteration,
though the total buffer accesses remain similar due to scattered
doc_ids in the CTID map.

Performance on MS-MARCO 8.8M docs (warm cache, -O2):
- V1: ~65ms for multi-term query
- V2: ~125ms (still ~2x slower due to CTID indirection)

The remaining gap is architectural: V1 stores CTIDs inline with
postings, while V2 uses a separate CTID map requiring random access.
Replace block-level CTID caching with segment-level split arrays:
- Store BlockNumber[] and OffsetNumber[] separately (6 bytes/doc total)
- Load both arrays into memory when opening segment
- Simple array indexing for CTID lookup during iteration
- Excellent cache locality since posting lists are sorted by doc_id

This eliminates per-posting buffer manager calls that caused V2 to be
slower than V1. The split storage also provides better packing than
the previous ItemPointerData[] approach.
UBSAN detected misaligned access when reading TpBlockPosting from
page buffers via zero-copy direct access. TpBlockPosting contains
a uint32 field requiring 4-byte alignment, but the data position
within the page may not be properly aligned.

Fix: Check alignment before using zero-copy access. If misaligned,
fall back to copying the block data to a properly-aligned buffer.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace qsort with a custom partial quicksort that only sorts the
top-k elements needed for results. This is O(n + k log k) instead
of O(n log n), providing significant speedup when k << n.

Key optimizations:
- Partial quicksort only recurses into partitions containing top-k
- Inlined score comparison eliminates function call overhead
- Insertion sort for small subarrays (< 16 elements)
- Median-of-three pivot selection for better partitioning

Performance improvement: ~40% faster for multi-term queries on
MS-MARCO (8.8M docs): 565ms -> 340ms for 5-term query.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When documents have equal BM25 scores, sort by CTID ascending to ensure
deterministic, reproducible query results. This makes the partial
quicksort stable for equal-scored documents.

Update limits test expected output to reflect the new deterministic
ordering.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Profile showed 70% of query time in hash operations (hash_search 38%,
hash_seq_search 32%). New TpDocScoreTable uses:
- Open-addressing with linear probing for better cache locality
- 64-bit packed CTID keys for fast hashing
- FNV-1a hash function
- Simple array iteration instead of bucket traversal

Performance improved from ~340ms to ~40-75ms for 5-term queries.
The FNV-1a hash was only processing bytes 0-5 (bits 0-47), missing
the high 16 bits of block number (bits 48-63). For tables > 512MB,
this would cause hash collisions for CTIDs differing only in those
bits.

Also fixed incorrect comment claiming 24-byte entry size (it's 16).
@tjgreen42 tjgreen42 changed the title Fix O(terms × segments) query performance regression Fix V2 segment query performance and optimize BM25 scoring Dec 24, 2025
tjgreen42 and others added 9 commits December 25, 2025 17:50
Add a planner hook that replaces resjunk ORDER BY score expressions
with a cheap stub function that returns the cached score from the
index scan. This avoids computing BM25 scores twice (once in the
index scan, once in expression evaluation).

- Add tp_cached_score infrastructure in index.c
- Add bm25_get_current_score() stub function in query.c
- Add planner hook to detect BM25 IndexScan and replace expressions
- Only replace when BM25 IndexScan is present (safe for SeqScan fallback)

Also add fieldnorm_discrepancy test documenting score corruption
after L0->L1 merge (tracked in issue #93).
Revert commits 014207f and a422e57 that introduced a custom
linear-probing hash table. Profiling showed tp_doc_score_table_insert
taking ~50% of query time, and the custom FNV-1a hash was doing
8 multiplies per insert.

PostgreSQL's built-in dynahash is well-tested and optimized.
Stick with it unless we have compelling evidence it's a bottleneck.
The tp_cached_score_valid flag and related functions were never used.
Simplify to just the cached score value.
Previously the planner hook only replaced resjunk ORDER BY expressions
with the cached score stub. Now it also replaces matching expressions
in the SELECT clause.

For queries like:
  SELECT content <@> 'hello' AS score FROM t ORDER BY content <@> 'hello'

Both the SELECT and ORDER BY expressions are replaced with
bm25_get_current_score(), avoiding duplicate score computation.
BM25 queries should be driven by ORDER BY + LIMIT, not filtered with
WHERE score < 0. The WHERE clause was pointless (matches nearly all
rows since scores are non-positive) and just added overhead.

Cleaned up benchmarks and tests to use the correct pattern:
  SELECT ... ORDER BY col <@> query LIMIT n

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds parallel benchmarks that include score in the SELECT clause:
- Benchmark 1b: Query latency with score (5 query types × 10 iterations)
- Benchmark 2b: Query throughput with score (20 queries)

The extraction script already supported with-score metrics. Updated
format_for_action.sh to include all with-score query latencies in
the graphs.

This lets us measure the impact of the stub optimization that replaces
duplicate score expressions in SELECT with a cached value lookup.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Without LIMIT, PostgreSQL returns all rows sorted by score. The index
only provides matching documents - non-matching ones get score 0 from
standalone scoring. Adding LIMIT allows the index to satisfy the query
entirely.

Also added SET enable_seqscan = off for consistency with other tests.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When the SELECT clause uses ROUND((score)::numeric, 4) as score,
ORDER BY score refers to the ROUND() output, not the raw operator.
The index can't provide ordering by ROUND(score), only by the raw
score expression.

Solution: ORDER BY the raw expression while displaying ROUND() in SELECT.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous change incorrectly removed WHERE clauses from COUNT queries,
replacing them with ORDER BY, which is invalid SQL (can't ORDER BY a column
not in SELECT when using aggregates).

Fix by using subquery pattern with ORDER BY + LIMIT, which uses the index.
WHERE clauses are unnecessary since the index only returns matching documents.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
The test reveals that document id=1 from the first segment is lost during
merge for 'database' and 'world' terms (though 'hello' still works).
Updated comments to accurately describe this bug instead of the outdated
comment about "Query code only searches L0 segments".

🤖 Generated with [Claude Code](https://claude.com/claude-code)
During N-way term merge, comparisons used min_term which pointed to
freed memory after advancing sources. This caused documents with terms
in multiple segments to be lost, corrupting corpus statistics and
BM25 scores.

Fix: Compare against current_merged->term (the pstrdup'd copy) instead.

Fixes #93
Queries without LIMIT were scanning all rows and showing 0 scores
for non-matching documents. Changed to use ORDER BY expression + LIMIT
pattern which uses the BM25 index and only returns matching documents.

Files fixed:
- test/sql/partitioned.sql
- test/sql/mixed.sql
- test/sql/vector.sql
@tjgreen42 tjgreen42 changed the title Fix V2 segment query performance and optimize BM25 scoring Fix V2 segment query performance regression Dec 28, 2025
@tjgreen42 tjgreen42 merged commit f3c6e14 into main Dec 28, 2025
11 checks passed
@tjgreen42 tjgreen42 deleted the work2 branch December 28, 2025 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BM25 scores wrong after L0->L1 segment merge Query performance regression in V2 segment format

1 participant