GH#11922: Allow DisjunctionDISIApproximation to short-circuit#11928
GH#11922: Allow DisjunctionDISIApproximation to short-circuit#11928gsmiller wants to merge 3 commits intoapache:mainfrom
Conversation
lucene/MIGRATE.md
Outdated
|
|
||
| See LUCENE-10558 for more details and workarounds. | ||
|
|
||
| ### DisjunctionDISIApproximation behavior change |
There was a problem hiding this comment.
I currently have this marked as a 10.0 release change with a migration guide since DisjunctionDISIApproximation is a public class and this behavior change could impact users. Then again, it is marked as @lucene.internal, so maybe we can be more aggressive and release with a 9.x change?
There was a problem hiding this comment.
Yes we can be more aggressive, it's only public so that it can be reused for CombinedFieldsQuery.
There was a problem hiding this comment.
OK, yeah- makes sense. I'll move to the 9.5 section and remove the MIGRATE entry.
|
Have you checked if this actually made things faster? We're saving calls to |
I'd run some of our internal benchmarks with this change, and it did show a positive impact (reduced cost and latency). I did a quick run with luceneutil benchmark tasks, and it looked pretty flat. These were both kind of hastily run though, and I was guessing that the reason we saw an improvement internally is because we don't rely on scoring, so could take advantage (while I was guessing that the luceneutil tasks do scoring, but I didn't dig in yet). It's a really good callout about the condition check in the tight loop. That hadn't occurred to me, so thank you for mentioning it. For the common case of requiring scores, this change seems like it would just cause a net increase to cost, so I don't love that. As a next step, I'll dig into benchmarking a bit more. If it looks like there is a clear positive impact still to the non-scoring case, I'd like to see if there's a way to avoid the added condition-checking cost for the common scoring case (since all the sub-iterators will need to get advanced anyway). Thanks for the early feedback @jpountz ! |
jpountz
left a comment
There was a problem hiding this comment.
I left some minor suggestions but the change looks correct to me.
| public int nextDoc() throws IOException { | ||
| if (docID == DocIdSetIterator.NO_MORE_DOCS) { | ||
| return docID; | ||
| } |
There was a problem hiding this comment.
This is not necessary, it's illegal to call nextDoc on an exhausted DocIdSetIterator.
There was a problem hiding this comment.
Right, good point. Will simplify.
| if (target == DocIdSetIterator.NO_MORE_DOCS) { | ||
| docID = DocIdSetIterator.NO_MORE_DOCS; | ||
| return docID; | ||
| } |
There was a problem hiding this comment.
In general I'm dubious about such checks for NO_MORE_DOCS since we need to do this check on every candidate match but this condition is true at most once per segment, and often never. So it looks like extra overhead that doesn't bring significant value, and could prevent e.g. inlining due to increased bytecode size.
There was a problem hiding this comment.
That's fair. I removed the check.
|
@jpountz thanks for the implementation feedback! I've updated the PR, but still plan to do more benchmarking to really understand the benefit, etc. before looking to actually merge this. I'll follow up here once I've been able to do that. |
|
@jpountz I re-ran some internal benchmarking with this change to highlight the speedup in cases where scoring isn't needed (at least some specific use-cases I'm looking at). These use-cases all involve a "disjunction filter," meaning a disjunction of terms that is used as a required clause. So something like In these benchmarks, I'm observing a 2.3% QPS improvement, and a 3.5% avg. latency reduction (5.9% p50 reduction / 3.5% p99 reduction). So the change appears to be helping this type of situation. As for whether-or-not this change would actually hurt other common use-cases that require scoring or second-phase checks, I re-ran Given this, I'd prefer to move forward with the change (since it doesn't seem to be doing any harm to common cases covered in |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
jpountz
left a comment
There was a problem hiding this comment.
Very sorry for the lag, I'm terrible at keeping track of PRs. The explanation of when this change helps and the change itself look good to me.
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Description
This PR explores allowing
DisjunctionDISIApproximationto short-circuit after "finding" a doc, allowing sub-iterators to only be exhaustively advanced when necessary. This could be useful for non-scoring disjunction clauses.Note: I've only changed the approximation implementation to start.
DisjunctionScorerexhaustively advances all sub-iterators in this implementation if it needs to do two-phase match confirmation (which is what we do today). Ideally, we'd only advance as necessary in this step as well.