-
Notifications
You must be signed in to change notification settings - Fork 95
BM25 scores wrong after L0->L1 segment merge #93
Copy link
Copy link
Closed
Description
Summary
After an L0->L1 segment merge, BM25 scores are incorrect by ~29%. The same query without merge returns correct scores.
Reproduction
SET pg_textsearch.segments_per_level = 2;
CREATE TABLE test (id SERIAL PRIMARY KEY, content TEXT);
CREATE INDEX test_idx ON test USING bm25(content) WITH (text_config='english');
-- Phase 1: First L0 segment
INSERT INTO test (content) VALUES
('hello world database'), ('goodbye cruel world'),
('hello goodbye friend'), ('world peace harmony');
SELECT bm25_spill_index('test_idx');
-- Phase 2: Triggers L0->L1 merge
INSERT INTO test (content) VALUES
('database indexing query'), ('search engine optimization'),
('database world news'), ('goodbye database friend');
SELECT bm25_spill_index('test_idx');
-- Phase 3: Post-merge data
INSERT INTO test (content) VALUES
('hello new insertion'), ('database transaction log');
-- Query shows incorrect scores
SELECT content, content <@> to_bm25query('database', 'test_idx') AS score
FROM test WHERE content <@> to_bm25query('database', 'test_idx') < 0;Expected vs Actual
| Document | Expected Score | Actual Score | Error |
|---|---|---|---|
| hello world database | -0.6931 | -0.8938 | 29% |
| database indexing query | -0.6931 | -0.8938 | 29% |
| database world news | -0.6931 | -0.8938 | 29% |
| goodbye database friend | -0.6931 | -0.8938 | 29% |
| database transaction log | -0.6931 | -0.8938 | 29% |
Key Finding
Without merge (default segments_per_level): Scores match perfectly.
With merge (segments_per_level = 2): Scores are ~29% too high.
Likely Cause
The merge operation appears to corrupt corpus statistics (avg_length or total_docs). The merged segment may be using incorrect values for the BM25 formula's length normalization component.
Test File
Added test/sql/fieldnorm_discrepancy.sql to demonstrate this issue.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels