Skip to content

Enable rank-unsafe optimizations for MAXSCORE/WAND.#12446

Draft
jpountz wants to merge 1 commit intoapache:mainfrom
jpountz:rank_unsafe
Draft

Enable rank-unsafe optimizations for MAXSCORE/WAND.#12446
jpountz wants to merge 1 commit intoapache:mainfrom
jpountz:rank_unsafe

Conversation

@jpountz
Copy link
Contributor

@jpountz jpountz commented Jul 17, 2023

Both MAXSCORE and WAND can easily be tuned to perform rank-unsafe optimizations, by skipping doc IDs that are unlikely to make it to the top-k. The main challenge is how to expose this kind of optimization. One approach could consist of artificially increasing the minimum competitive score as suggested in the original WAND paper. The approach I'm considering here is to configure a target evaluation cost, giving the scorer a budget of documents that it can visit and asking it to compute the best hits it can identify with this budget.

This draft PR tries to give an idea of how it could look like. It's currently only implemented for our MAXSCORE implementation but could easily be ported to our WAND scorer too.

An interesting follow-up could be to integrate this into the timout mechanism, so that IndexSearcher would progressively reduce the target cost as the amount of remaining time reduces.

I'm interested in gathering feedback on this approach.

Both MAXSCORE and WAND can easily be tuned to perform rank-unsafe
optimizations, by skipping doc IDs that are unlikely to make it to the top-k.
The main challenge is how to expose this kind of optimization. One approach
could consist of artificially increasing the minimum competitive score as
suggested in the original WAND paper. The approach I'm considering here is to
configure a target evaluation cost, giving the scorer a budget of documents
that it can visit and asking it to compute the best hits it can identify with
this budget.

This draft PR tries to give an idea of how it could look like. It's currently
only implemented for our MAXSCORE implementation but could easily be ported to
our WAND scorer too.

An interesting follow-up could be to integrate this into the timout mechanism,
so that `IndexSearcher` would progressively reduce the target cost as the
amount of remaining time reduces.

I'm interested in gathering feedback on this approach.
@jpountz
Copy link
Contributor Author

jpountz commented Jul 17, 2023

As an example, with this PR and calling searcher.setMaxEvaluatedHitRatio(.001f), the query be (+mostly +interview) goes from 7.0ms to 2.7ms while still returning the same top 100 hits on wikimedium10m.

You typically won't see a speedup if you run pure disjunctions of terms because score upper bounds are so good that the rank-safe MAXSCORE logic already works extremely well. But if some clauses have less optimal score upper bounds, such as above with a conjunction, then you will see speedups.

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exciting!

public abstract long cost();

/**
* Optional operation: set the target cost. When set to a value that is less that {@link #cost()},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a neat idea! I am a little confused about some of the details though. One thing is I don't know if Scorer.cost() is always the number of documents the Scorer will match - or if it is even necessarily a count of documents? Maybe it is? Not sure if that matters here, but I wonder if we are now imposing a new requirement on the meaning of "cost".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadocs of DocIdSetIterator say this:

This is generally an upper bound of the number of documents this iterator might match, but may be a rough heuristic, hardcoded value, or otherwise completely inaccurate.

Taking advantage of this new API indeed relies on the cost() being somewhat meaningful.


// See if we can further reduce the set of essential scorers while still being above the target
// cost.
while (firstEssentialScorer < allScorers.length - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the sub-scorers sorted in some way so that this will be stable and not dependent on some arbitrary insertion order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sub scorers are sorted by ascending maximum score within a window in this scorer. I could add a tie-break on the cost to make it more stable.

@mikemccand
Copy link
Member

Rank unsafe optimizations is a neat idea! It'd give another tool for maybe more smoothly trading cost for recall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants