Enable rank-unsafe optimizations for MAXSCORE/WAND.#12446
Enable rank-unsafe optimizations for MAXSCORE/WAND.#12446jpountz wants to merge 1 commit intoapache:mainfrom
Conversation
Both MAXSCORE and WAND can easily be tuned to perform rank-unsafe optimizations, by skipping doc IDs that are unlikely to make it to the top-k. The main challenge is how to expose this kind of optimization. One approach could consist of artificially increasing the minimum competitive score as suggested in the original WAND paper. The approach I'm considering here is to configure a target evaluation cost, giving the scorer a budget of documents that it can visit and asking it to compute the best hits it can identify with this budget. This draft PR tries to give an idea of how it could look like. It's currently only implemented for our MAXSCORE implementation but could easily be ported to our WAND scorer too. An interesting follow-up could be to integrate this into the timout mechanism, so that `IndexSearcher` would progressively reduce the target cost as the amount of remaining time reduces. I'm interested in gathering feedback on this approach.
|
As an example, with this PR and calling You typically won't see a speedup if you run pure disjunctions of terms because score upper bounds are so good that the rank-safe MAXSCORE logic already works extremely well. But if some clauses have less optimal score upper bounds, such as above with a conjunction, then you will see speedups. |
| public abstract long cost(); | ||
|
|
||
| /** | ||
| * Optional operation: set the target cost. When set to a value that is less that {@link #cost()}, |
There was a problem hiding this comment.
This is a neat idea! I am a little confused about some of the details though. One thing is I don't know if Scorer.cost() is always the number of documents the Scorer will match - or if it is even necessarily a count of documents? Maybe it is? Not sure if that matters here, but I wonder if we are now imposing a new requirement on the meaning of "cost".
There was a problem hiding this comment.
Javadocs of DocIdSetIterator say this:
This is generally an upper bound of the number of documents this iterator might match, but may be a rough heuristic, hardcoded value, or otherwise completely inaccurate.
Taking advantage of this new API indeed relies on the cost() being somewhat meaningful.
|
|
||
| // See if we can further reduce the set of essential scorers while still being above the target | ||
| // cost. | ||
| while (firstEssentialScorer < allScorers.length - 1 |
There was a problem hiding this comment.
Are the sub-scorers sorted in some way so that this will be stable and not dependent on some arbitrary insertion order?
There was a problem hiding this comment.
Sub scorers are sorted by ascending maximum score within a window in this scorer. I could add a tie-break on the cost to make it more stable.
|
Rank unsafe optimizations is a neat idea! It'd give another tool for maybe more smoothly trading cost for recall. |
Both MAXSCORE and WAND can easily be tuned to perform rank-unsafe optimizations, by skipping doc IDs that are unlikely to make it to the top-k. The main challenge is how to expose this kind of optimization. One approach could consist of artificially increasing the minimum competitive score as suggested in the original WAND paper. The approach I'm considering here is to configure a target evaluation cost, giving the scorer a budget of documents that it can visit and asking it to compute the best hits it can identify with this budget.
This draft PR tries to give an idea of how it could look like. It's currently only implemented for our MAXSCORE implementation but could easily be ported to our WAND scorer too.
An interesting follow-up could be to integrate this into the timout mechanism, so that
IndexSearcherwould progressively reduce the target cost as the amount of remaining time reduces.I'm interested in gathering feedback on this approach.