Skip to content

GH#11922: Allow DisjunctionDISIApproximation to short-circuit#11928

Open
gsmiller wants to merge 3 commits intoapache:mainfrom
gsmiller:explore/optional-approx-advance-opto-single-queue-pr
Open

GH#11922: Allow DisjunctionDISIApproximation to short-circuit#11928
gsmiller wants to merge 3 commits intoapache:mainfrom
gsmiller:explore/optional-approx-advance-opto-single-queue-pr

Conversation

@gsmiller
Copy link
Copy Markdown
Contributor

Description

This PR explores allowing DisjunctionDISIApproximation to short-circuit after "finding" a doc, allowing sub-iterators to only be exhaustively advanced when necessary. This could be useful for non-scoring disjunction clauses.

Note: I've only changed the approximation implementation to start. DisjunctionScorer exhaustively advances all sub-iterators in this implementation if it needs to do two-phase match confirmation (which is what we do today). Ideally, we'd only advance as necessary in this step as well.


See LUCENE-10558 for more details and workarounds.

### DisjunctionDISIApproximation behavior change
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently have this marked as a 10.0 release change with a migration guide since DisjunctionDISIApproximation is a public class and this behavior change could impact users. Then again, it is marked as @lucene.internal, so maybe we can be more aggressive and release with a 9.x change?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can be more aggressive, it's only public so that it can be reused for CombinedFieldsQuery.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yeah- makes sense. I'll move to the 9.5 section and remove the MIGRATE entry.

@jpountz
Copy link
Copy Markdown
Contributor

jpountz commented Nov 15, 2022

Have you checked if this actually made things faster? We're saving calls to advance in some cases but also adding conditions to some very tight loops. E.g. we recently got speedups by moving some queries from WAND to MAXSCORE even though WAND is better at saving calls to advance compared to MAXSCORE.

@gsmiller
Copy link
Copy Markdown
Contributor Author

Have you checked if this actually made things faster? We're saving calls to advance in some cases but also adding conditions to some very tight loops.

I'd run some of our internal benchmarks with this change, and it did show a positive impact (reduced cost and latency). I did a quick run with luceneutil benchmark tasks, and it looked pretty flat. These were both kind of hastily run though, and I was guessing that the reason we saw an improvement internally is because we don't rely on scoring, so could take advantage (while I was guessing that the luceneutil tasks do scoring, but I didn't dig in yet).

It's a really good callout about the condition check in the tight loop. That hadn't occurred to me, so thank you for mentioning it. For the common case of requiring scores, this change seems like it would just cause a net increase to cost, so I don't love that.

As a next step, I'll dig into benchmarking a bit more. If it looks like there is a clear positive impact still to the non-scoring case, I'd like to see if there's a way to avoid the added condition-checking cost for the common scoring case (since all the sub-iterators will need to get advanced anyway).

Thanks for the early feedback @jpountz !

Copy link
Copy Markdown
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor suggestions but the change looks correct to me.

public int nextDoc() throws IOException {
if (docID == DocIdSetIterator.NO_MORE_DOCS) {
return docID;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary, it's illegal to call nextDoc on an exhausted DocIdSetIterator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, good point. Will simplify.

if (target == DocIdSetIterator.NO_MORE_DOCS) {
docID = DocIdSetIterator.NO_MORE_DOCS;
return docID;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'm dubious about such checks for NO_MORE_DOCS since we need to do this check on every candidate match but this condition is true at most once per segment, and often never. So it looks like extra overhead that doesn't bring significant value, and could prevent e.g. inlining due to increased bytecode size.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I removed the check.

@gsmiller
Copy link
Copy Markdown
Contributor Author

@jpountz thanks for the implementation feedback! I've updated the PR, but still plan to do more benchmarking to really understand the benefit, etc. before looking to actually merge this. I'll follow up here once I've been able to do that.

@gsmiller
Copy link
Copy Markdown
Contributor Author

@jpountz I re-ran some internal benchmarking with this change to highlight the speedup in cases where scoring isn't needed (at least some specific use-cases I'm looking at). These use-cases all involve a "disjunction filter," meaning a disjunction of terms that is used as a required clause. So something like (+ (foo:bar foo:baz foo:zed) (...)), where the foo field must take on one of the specified values to be considered a candidate match. To provide a sense of scale, on average, these filters have 40 different terms in them. Since these "filters" don't participate in scoring at all, it's a good candidate for this short-circuiting.

In these benchmarks, I'm observing a 2.3% QPS improvement, and a 3.5% avg. latency reduction (5.9% p50 reduction / 3.5% p99 reduction). So the change appears to be helping this type of situation.

As for whether-or-not this change would actually hurt other common use-cases that require scoring or second-phase checks, I re-ran luceneutil benchmarks (wikimedium10m) task and don't observe any regressions there (results below). It's possible there's a gap in our benchmarks though, and maybe there are some common use-cases not covered?

                            TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
                 MedSloppyPhrase      115.00      (4.9%)      112.91      (5.1%)   -1.8% ( -11% -    8%) 0.249
               HighTermTitleSort      242.48      (3.2%)      238.66      (4.3%)   -1.6% (  -8% -    6%) 0.189
                HighSloppyPhrase       36.47      (3.8%)       36.12      (3.9%)   -0.9% (  -8% -    7%) 0.439
                         LowTerm     1766.82      (3.4%)     1752.12      (3.3%)   -0.8% (  -7% -    6%) 0.436
                      HighPhrase      263.74      (3.4%)      261.71      (2.3%)   -0.8% (  -6% -    5%) 0.404
                       OrHighLow      796.71      (2.7%)      790.71      (2.6%)   -0.8% (  -5% -    4%) 0.367
            BrowseDateSSDVFacets        3.46      (6.4%)        3.44      (7.1%)   -0.7% ( -13% -   13%) 0.755
               HighTermMonthSort     3070.86      (4.5%)     3051.39      (3.9%)   -0.6% (  -8% -    8%) 0.635
                         Prefix3      111.76      (4.6%)      111.08      (4.2%)   -0.6% (  -8% -    8%) 0.658
                   OrNotHighHigh     1249.27      (3.4%)     1242.30      (3.8%)   -0.6% (  -7% -    6%) 0.627
           BrowseMonthTaxoFacets       35.43      (1.6%)       35.23      (2.2%)   -0.6% (  -4% -    3%) 0.367
                 LowSloppyPhrase       61.49      (2.4%)       61.18      (2.5%)   -0.5% (  -5% -    4%) 0.512
                    OrHighNotMed     1139.05      (3.9%)     1133.29      (3.6%)   -0.5% (  -7% -    7%) 0.671
     BrowseRandomLabelTaxoFacets       20.33      (4.5%)       20.23      (5.6%)   -0.5% ( -10% -   10%) 0.760
                        HighTerm     1635.45      (4.2%)     1628.28      (3.9%)   -0.4% (  -8% -    8%) 0.735
                       MedPhrase       46.41      (2.3%)       46.22      (1.6%)   -0.4% (  -4% -    3%) 0.529
                       OrHighMed      193.55      (2.8%)      192.79      (2.9%)   -0.4% (  -5% -    5%) 0.663
                   OrHighNotHigh      865.20      (3.0%)      862.38      (3.9%)   -0.3% (  -6% -    6%) 0.766
                      AndHighLow     1566.83      (2.7%)     1562.47      (2.7%)   -0.3% (  -5% -    5%) 0.745
            MedTermDayTaxoFacets       48.00      (3.5%)       47.89      (3.6%)   -0.2% (  -7% -    7%) 0.836
           HighTermDayOfYearSort      812.55      (2.8%)      811.49      (2.5%)   -0.1% (  -5% -    5%) 0.878
                         MedTerm     2390.59      (3.5%)     2387.70      (3.5%)   -0.1% (  -6% -    7%) 0.912
            BrowseDateTaxoFacets       25.05      (9.2%)       25.03      (9.3%)   -0.1% ( -16% -   20%) 0.984
           BrowseMonthSSDVFacets       16.00     (18.9%)       15.99     (19.0%)   -0.1% ( -31% -   46%) 0.992
             LowIntervalsOrdered      276.77      (3.5%)      276.66      (3.6%)   -0.0% (  -6% -    7%) 0.972
                      OrHighHigh       26.41      (4.4%)       26.40      (4.4%)   -0.0% (  -8% -    9%) 0.978
                     MedSpanNear       35.38      (1.0%)       35.38      (1.1%)   -0.0% (  -2% -    2%) 0.969
                      TermDTSort      648.08      (1.0%)      648.35      (1.3%)    0.0% (  -2% -    2%) 0.909
       BrowseDayOfYearTaxoFacets       25.12      (9.5%)       25.13      (9.5%)    0.0% ( -17% -   21%) 0.988
                    OrNotHighLow     1192.74      (3.1%)     1193.69      (2.6%)    0.1% (  -5% -    5%) 0.929
            HighIntervalsOrdered        1.79      (3.3%)        1.79      (3.3%)    0.1% (  -6% -    6%) 0.934
                         Respell       46.93      (1.3%)       46.97      (1.3%)    0.1% (  -2% -    2%) 0.825
                     LowSpanNear       11.96      (0.8%)       11.97      (0.9%)    0.1% (  -1% -    1%) 0.702
     BrowseRandomLabelSSDVFacets       10.14      (9.0%)       10.16      (9.4%)    0.1% ( -16% -   20%) 0.961
                       LowPhrase      438.44      (1.6%)      439.23      (1.4%)    0.2% (  -2% -    3%) 0.701
        AndHighHighDayTaxoFacets       28.52      (1.9%)       28.57      (1.8%)    0.2% (  -3% -    3%) 0.746
             MedIntervalsOrdered       64.28      (2.3%)       64.42      (2.3%)    0.2% (  -4% -    4%) 0.777
                    OrHighNotLow     1854.91      (2.7%)     1858.96      (3.9%)    0.2% (  -6% -    6%) 0.836
                    HighSpanNear       12.65      (1.4%)       12.69      (1.7%)    0.3% (  -2% -    3%) 0.579
          OrHighMedDayTaxoFacets       14.34      (3.0%)       14.38      (4.3%)    0.3% (  -6% -    7%) 0.809
                          Fuzzy1       81.66      (0.9%)       81.90      (1.1%)    0.3% (  -1% -    2%) 0.357
       BrowseDayOfYearSSDVFacets       13.40     (12.7%)       13.44     (12.5%)    0.3% ( -22% -   29%) 0.942
         AndHighMedDayTaxoFacets       34.33      (2.1%)       34.45      (1.7%)    0.3% (  -3% -    4%) 0.569
                          Fuzzy2       76.36      (0.8%)       76.64      (0.9%)    0.4% (  -1% -    2%) 0.187
            HighTermTitleBDVSort       12.49      (6.6%)       12.55      (6.6%)    0.4% ( -11% -   14%) 0.846
                        PKLookup      179.34      (2.2%)      180.48      (2.3%)    0.6% (  -3% -    5%) 0.364
                      AndHighMed      268.67      (3.8%)      270.48      (3.9%)    0.7% (  -6% -    8%) 0.579
                    OrNotHighMed     1006.62      (4.1%)     1013.59      (4.3%)    0.7% (  -7% -    9%) 0.604
                        Wildcard      133.41      (2.7%)      134.34      (2.7%)    0.7% (  -4% -    6%) 0.417
                     AndHighHigh       74.34      (4.7%)       74.95      (5.2%)    0.8% (  -8% -   11%) 0.608
                          IntNRQ       82.56      (3.5%)       83.70      (4.9%)    1.4% (  -6% -   10%) 0.306

Given this, I'd prefer to move forward with the change (since it doesn't seem to be doing any harm to common cases covered in luceneutil and is helping in the filtering use-cases I'm benchmarking), but I'd like your opinion. Given how central of a component this is to scoring, I want to make sure I'm not being overly biased towards the use-case I'm focused on at the cost of regressing elsewhere. Thanks again for your inputs!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
Copy link
Copy Markdown
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very sorry for the lag, I'm terrible at keeping track of PRs. The explanation of when this change helps and the change itself look good to me.

@github-actions github-actions bot removed the Stale label Jan 9, 2024
@github-actions
Copy link
Copy Markdown
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 24, 2024
@github-actions github-actions bot removed the Stale label Oct 2, 2025
@github-actions
Copy link
Copy Markdown
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants