Enable rank-unsafe optimizations for MAXSCORE/WAND. by jpountz · Pull Request #12446 · apache/lucene

jpountz · 2023-07-17T15:21:34Z

Both MAXSCORE and WAND can easily be tuned to perform rank-unsafe optimizations, by skipping doc IDs that are unlikely to make it to the top-k. The main challenge is how to expose this kind of optimization. One approach could consist of artificially increasing the minimum competitive score as suggested in the original WAND paper. The approach I'm considering here is to configure a target evaluation cost, giving the scorer a budget of documents that it can visit and asking it to compute the best hits it can identify with this budget.

This draft PR tries to give an idea of how it could look like. It's currently only implemented for our MAXSCORE implementation but could easily be ported to our WAND scorer too.

An interesting follow-up could be to integrate this into the timout mechanism, so that IndexSearcher would progressively reduce the target cost as the amount of remaining time reduces.

I'm interested in gathering feedback on this approach.

Both MAXSCORE and WAND can easily be tuned to perform rank-unsafe optimizations, by skipping doc IDs that are unlikely to make it to the top-k. The main challenge is how to expose this kind of optimization. One approach could consist of artificially increasing the minimum competitive score as suggested in the original WAND paper. The approach I'm considering here is to configure a target evaluation cost, giving the scorer a budget of documents that it can visit and asking it to compute the best hits it can identify with this budget. This draft PR tries to give an idea of how it could look like. It's currently only implemented for our MAXSCORE implementation but could easily be ported to our WAND scorer too. An interesting follow-up could be to integrate this into the timout mechanism, so that `IndexSearcher` would progressively reduce the target cost as the amount of remaining time reduces. I'm interested in gathering feedback on this approach.

jpountz · 2023-07-17T15:41:38Z

As an example, with this PR and calling searcher.setMaxEvaluatedHitRatio(.001f), the query be (+mostly +interview) goes from 7.0ms to 2.7ms while still returning the same top 100 hits on wikimedium10m.

You typically won't see a speedup if you run pure disjunctions of terms because score upper bounds are so good that the rank-safe MAXSCORE logic already works extremely well. But if some clauses have less optimal score upper bounds, such as above with a conjunction, then you will see speedups.

msokolov

exciting!

msokolov · 2023-07-22T20:47:57Z

lucene/core/src/java/org/apache/lucene/search/BulkScorer.java

  public abstract long cost();
+
+  /**
+   * Optional operation: set the target cost. When set to a value that is less that {@link #cost()},


This is a neat idea! I am a little confused about some of the details though. One thing is I don't know if Scorer.cost() is always the number of documents the Scorer will match - or if it is even necessarily a count of documents? Maybe it is? Not sure if that matters here, but I wonder if we are now imposing a new requirement on the meaning of "cost".

Javadocs of DocIdSetIterator say this:

This is generally an upper bound of the number of documents this iterator might match, but may be a rough heuristic, hardcoded value, or otherwise completely inaccurate.

Taking advantage of this new API indeed relies on the cost() being somewhat meaningful.

msokolov · 2023-07-22T20:48:49Z

lucene/core/src/java/org/apache/lucene/search/MaxScoreBulkScorer.java

+
+    // See if we can further reduce the set of essential scorers while still being above the target
+    // cost.
+    while (firstEssentialScorer < allScorers.length - 1


Are the sub-scorers sorted in some way so that this will be stable and not dependent on some arbitrary insertion order?

Sub scorers are sorted by ascending maximum score within a window in this scorer. I could add a tie-break on the cost to make it more stable.

mikemccand · 2023-11-02T14:21:44Z

Rank unsafe optimizations is a neat idea! It'd give another tool for maybe more smoothly trading cost for recall.

msokolov reviewed Jul 22, 2023

View reviewed changes

jpountz mentioned this pull request Oct 8, 2023

Enable rank-unsafe optimization of top-k hit computations by quantizing scores. #12628

Draft

jpountz mentioned this pull request Jan 27, 2025

Add knn result consistency test #14167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable rank-unsafe optimizations for MAXSCORE/WAND.#12446

Enable rank-unsafe optimizations for MAXSCORE/WAND.#12446
jpountz wants to merge 1 commit intoapache:mainfrom
jpountz:rank_unsafe

jpountz commented Jul 17, 2023

Uh oh!

jpountz commented Jul 17, 2023 •

edited

Loading

Uh oh!

msokolov left a comment

Uh oh!

msokolov Jul 22, 2023

Uh oh!

jpountz Jul 24, 2023

Uh oh!

msokolov Jul 22, 2023

Uh oh!

jpountz Jul 24, 2023

Uh oh!

mikemccand commented Nov 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jpountz commented Jul 17, 2023

Uh oh!

jpountz commented Jul 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msokolov left a comment

Choose a reason for hiding this comment

Uh oh!

msokolov Jul 22, 2023

Choose a reason for hiding this comment

Uh oh!

jpountz Jul 24, 2023

Choose a reason for hiding this comment

Uh oh!

msokolov Jul 22, 2023

Choose a reason for hiding this comment

Uh oh!

jpountz Jul 24, 2023

Choose a reason for hiding this comment

Uh oh!

mikemccand commented Nov 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jpountz commented Jul 17, 2023 •

edited

Loading