Optimize Text payload matcher by agourlay · Pull Request #6766 · qdrant/qdrant

agourlay · 2025-06-26T12:20:51Z

The initial goal of this PR was to fix an over measurement of payload_index_io_read for the Text index.

In practice, the final fix is a performance optimization decreasing the hardware counter as a side effect.

In my tests, the payload_index_io_read value for Text matching is almost 3x higher than what is reported by the read_bytes of the process.

In case of high cardinality Text matching, the search is not driven directly by the payload index.
Meaning a potentially large number of points need to be tested against the payload index to check if the condition matches.

The current approach in dev lazily recomputes the set of points which match the query tokens for each point repeatedly.
This set is a constant for all points tested.

This inflates the number of payload_index_io_read of ops artificially as the mmap slice is cached.

The proposed fix is to pre-calculate the intersection of points for the query tokens to reuse it for each point test.

I was able to show locally that this change:

fixes the payload_index_io_read to be much closer to read_bytes
makes the query slightly faster

The trade-off is to use more memory once to avoid repeated IO.

The set of points kept in memory is the intersection of the tokens posting lists, therefore it should not grow out of proportion when adding tokens to the query.

Future work

Phrase matching does not leverage the optimization yet.
Hopefully we get to follow a similar patterns where positions are handled properly.

lib/segment/tests/integration/payload_index_test.rs

lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs

lib/segment/src/index/field_index/full_text_index/text_index.rs

generall

Benchmarks are controversial

timvisee

In your comment you mention we lazily compute a list of points 3 times. Could you point to the exact place where we do this? That is not immediately clear to me from this PR.

timvisee · 2025-07-07T09:43:54Z

lib/segment/src/index/query_optimization/condition_converter/match_converter.rs

+            // Build the set intersection of point ids from the query tokens
+            let points_for_token = full_text_index
+                .filter_query(parsed_query.clone(), &hw_counter)
+                .collect::<AHashSet<_>>();
+
            Some(Box::new(move |point_id: PointOffsetType| {
-                full_text_index.check_match(&parsed_query, point_id, &hw_counter)
+                points_for_token.contains(&point_id)
            }))


Naive question: if we can do this here, can't we just do the same inside full_text_index.check_match(..)?

agourlay · 2025-07-08T14:12:28Z

The optimization is working fine when a lot of points are being tested in the matcher.

Otherwise, the cost of of preparing the intersection up front could be more expensive than the savings.

To make the matter worse, we actually build the filter_context more than once per query.
e.g:

sampling: https://github.com/qdrant/qdrant/blob/530430fac2a3ca872504f276d2c91a5c91f43fa0/lib/segment/src/index/hnsw_index/hnsw.rs#L1310C21-L1310C35
per query::

qdrant/lib/segment/src/index/hnsw_index/hnsw.rs

Line 940 in 530430f

self.search_with_graph(other, filter, top, params, None, vector_query_context)
leaf call:

qdrant/lib/segment/src/index/hnsw_index/hnsw.rs

Line 890 in 530430f

let filter_context = filter.map(|f| payload_index.filter_context(f, &hw_counter));

Creating a single filter_context would improve the performance and maybe make the tradeoff of this PR much more of a win.

But making this change is not trivial so I will close the current approach as it will never work as long as filter_context are assumed to be cheap objects.

agourlay force-pushed the optimize-text-payload-matcher branch from c4f4006 to 16ae35a Compare June 27, 2025 09:09

agourlay marked this pull request as ready for review June 27, 2025 15:18

agourlay requested review from coszio and timvisee June 27, 2025 15:18

This comment was marked as resolved.

Sign in to view

agourlay commented Jun 27, 2025

View reviewed changes

lib/segment/tests/integration/payload_index_test.rs Outdated Show resolved Hide resolved

coszio reviewed Jun 27, 2025

View reviewed changes

lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs Outdated Show resolved Hide resolved

lib/segment/src/index/field_index/full_text_index/text_index.rs Outdated Show resolved Hide resolved

agourlay force-pushed the optimize-text-payload-matcher branch from 8ca10ad to 3a99f45 Compare June 30, 2025 19:22

agourlay requested a review from coszio June 30, 2025 19:44

agourlay force-pushed the optimize-text-payload-matcher branch from 3a99f45 to 4f3e5ce Compare July 1, 2025 08:50

coszio approved these changes Jul 1, 2025

View reviewed changes

generall requested changes Jul 1, 2025

View reviewed changes

timvisee reviewed Jul 7, 2025

View reviewed changes

agourlay added 13 commits July 7, 2025 15:05

Optimize Text payload matcher

57bcece

intersect postings sets for less memory usage

d044bcf

do not apply the optimization for phrase matching

adfe3fc

cleanup

c8bd382

cleanup bis

a39ebb8

make test fail

11560a0

use correct impl. for mutable inverted index

1ee0f4f

renaming review

56b0cb8

do not always generate the same filters

3cc3969

use existing infra. for intersection

492f9df

clean

6bc6717

make change minimal

3610c35

micro diff

fced9bb

agourlay force-pushed the optimize-text-payload-matcher branch from 1b0c81c to fced9bb Compare July 7, 2025 13:06

agourlay closed this Jul 8, 2025

agourlay mentioned this pull request Jul 9, 2025

Do not track HW measurement when matching over a text index #6833

Merged

coderabbitai bot mentioned this pull request Jul 22, 2025

[map index] use roaring bitmap in mutable map index #6926

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Text payload matcher#6766

Optimize Text payload matcher#6766
agourlay wants to merge 13 commits intodevfrom
optimize-text-payload-matcher

agourlay commented Jun 26, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

generall left a comment

Uh oh!

timvisee left a comment •

edited

Loading

Uh oh!

timvisee Jul 7, 2025

Uh oh!

agourlay commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

agourlay commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Future work

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

generall left a comment

Choose a reason for hiding this comment

Uh oh!

timvisee left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timvisee Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

agourlay commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

agourlay commented Jun 26, 2025 •

edited

Loading

timvisee left a comment •

edited

Loading