GH#11922: Allow DisjunctionDISIApproximation to short-circuit by gsmiller · Pull Request #11928 · apache/lucene

gsmiller · 2022-11-14T17:03:35Z

Description

This PR explores allowing DisjunctionDISIApproximation to short-circuit after "finding" a doc, allowing sub-iterators to only be exhaustively advanced when necessary. This could be useful for non-scoring disjunction clauses.

Note: I've only changed the approximation implementation to start. DisjunctionScorer exhaustively advances all sub-iterators in this implementation if it needs to do two-phase match confirmation (which is what we do today). Ideally, we'd only advance as necessary in this step as well.

gsmiller · 2022-11-14T17:04:38Z

lucene/MIGRATE.md


 See LUCENE-10558 for more details and workarounds.

+### DisjunctionDISIApproximation behavior change


I currently have this marked as a 10.0 release change with a migration guide since DisjunctionDISIApproximation is a public class and this behavior change could impact users. Then again, it is marked as @lucene.internal, so maybe we can be more aggressive and release with a 9.x change?

Yes we can be more aggressive, it's only public so that it can be reused for CombinedFieldsQuery.

OK, yeah- makes sense. I'll move to the 9.5 section and remove the MIGRATE entry.

jpountz · 2022-11-15T14:09:17Z

Have you checked if this actually made things faster? We're saving calls to advance in some cases but also adding conditions to some very tight loops. E.g. we recently got speedups by moving some queries from WAND to MAXSCORE even though WAND is better at saving calls to advance compared to MAXSCORE.

gsmiller · 2022-11-15T19:44:17Z

Have you checked if this actually made things faster? We're saving calls to advance in some cases but also adding conditions to some very tight loops.

I'd run some of our internal benchmarks with this change, and it did show a positive impact (reduced cost and latency). I did a quick run with luceneutil benchmark tasks, and it looked pretty flat. These were both kind of hastily run though, and I was guessing that the reason we saw an improvement internally is because we don't rely on scoring, so could take advantage (while I was guessing that the luceneutil tasks do scoring, but I didn't dig in yet).

It's a really good callout about the condition check in the tight loop. That hadn't occurred to me, so thank you for mentioning it. For the common case of requiring scores, this change seems like it would just cause a net increase to cost, so I don't love that.

As a next step, I'll dig into benchmarking a bit more. If it looks like there is a clear positive impact still to the non-scoring case, I'd like to see if there's a way to avoid the added condition-checking cost for the common scoring case (since all the sub-iterators will need to get advanced anyway).

Thanks for the early feedback @jpountz !

jpountz

I left some minor suggestions but the change looks correct to me.

jpountz · 2022-11-16T09:09:39Z

lucene/core/src/java/org/apache/lucene/search/DisjunctionDISIApproximation.java

+  public int nextDoc() throws IOException {
+    if (docID == DocIdSetIterator.NO_MORE_DOCS) {
+      return docID;
+    }


This is not necessary, it's illegal to call nextDoc on an exhausted DocIdSetIterator.

Right, good point. Will simplify.

jpountz · 2022-11-16T09:11:40Z

lucene/core/src/java/org/apache/lucene/search/DisjunctionDISIApproximation.java

+    if (target == DocIdSetIterator.NO_MORE_DOCS) {
+      docID = DocIdSetIterator.NO_MORE_DOCS;
+      return docID;
+    }


In general I'm dubious about such checks for NO_MORE_DOCS since we need to do this check on every candidate match but this condition is true at most once per segment, and often never. So it looks like extra overhead that doesn't bring significant value, and could prevent e.g. inlining due to increased bytecode size.

That's fair. I removed the check.

gsmiller · 2022-11-18T18:02:48Z

@jpountz thanks for the implementation feedback! I've updated the PR, but still plan to do more benchmarking to really understand the benefit, etc. before looking to actually merge this. I'll follow up here once I've been able to do that.

gsmiller · 2022-11-29T20:53:43Z

@jpountz I re-ran some internal benchmarking with this change to highlight the speedup in cases where scoring isn't needed (at least some specific use-cases I'm looking at). These use-cases all involve a "disjunction filter," meaning a disjunction of terms that is used as a required clause. So something like (+ (foo:bar foo:baz foo:zed) (...)), where the foo field must take on one of the specified values to be considered a candidate match. To provide a sense of scale, on average, these filters have 40 different terms in them. Since these "filters" don't participate in scoring at all, it's a good candidate for this short-circuiting.

In these benchmarks, I'm observing a 2.3% QPS improvement, and a 3.5% avg. latency reduction (5.9% p50 reduction / 3.5% p99 reduction). So the change appears to be helping this type of situation.

As for whether-or-not this change would actually hurt other common use-cases that require scoring or second-phase checks, I re-ran luceneutil benchmarks (wikimedium10m) task and don't observe any regressions there (results below). It's possible there's a gap in our benchmarks though, and maybe there are some common use-cases not covered?

                            TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
                 MedSloppyPhrase      115.00      (4.9%)      112.91      (5.1%)   -1.8% ( -11% -    8%) 0.249
               HighTermTitleSort      242.48      (3.2%)      238.66      (4.3%)   -1.6% (  -8% -    6%) 0.189
                HighSloppyPhrase       36.47      (3.8%)       36.12      (3.9%)   -0.9% (  -8% -    7%) 0.439
                         LowTerm     1766.82      (3.4%)     1752.12      (3.3%)   -0.8% (  -7% -    6%) 0.436
                      HighPhrase      263.74      (3.4%)      261.71      (2.3%)   -0.8% (  -6% -    5%) 0.404
                       OrHighLow      796.71      (2.7%)      790.71      (2.6%)   -0.8% (  -5% -    4%) 0.367
            BrowseDateSSDVFacets        3.46      (6.4%)        3.44      (7.1%)   -0.7% ( -13% -   13%) 0.755
               HighTermMonthSort     3070.86      (4.5%)     3051.39      (3.9%)   -0.6% (  -8% -    8%) 0.635
                         Prefix3      111.76      (4.6%)      111.08      (4.2%)   -0.6% (  -8% -    8%) 0.658
                   OrNotHighHigh     1249.27      (3.4%)     1242.30      (3.8%)   -0.6% (  -7% -    6%) 0.627
           BrowseMonthTaxoFacets       35.43      (1.6%)       35.23      (2.2%)   -0.6% (  -4% -    3%) 0.367
                 LowSloppyPhrase       61.49      (2.4%)       61.18      (2.5%)   -0.5% (  -5% -    4%) 0.512
                    OrHighNotMed     1139.05      (3.9%)     1133.29      (3.6%)   -0.5% (  -7% -    7%) 0.671
     BrowseRandomLabelTaxoFacets       20.33      (4.5%)       20.23      (5.6%)   -0.5% ( -10% -   10%) 0.760
                        HighTerm     1635.45      (4.2%)     1628.28      (3.9%)   -0.4% (  -8% -    8%) 0.735
                       MedPhrase       46.41      (2.3%)       46.22      (1.6%)   -0.4% (  -4% -    3%) 0.529
                       OrHighMed      193.55      (2.8%)      192.79      (2.9%)   -0.4% (  -5% -    5%) 0.663
                   OrHighNotHigh      865.20      (3.0%)      862.38      (3.9%)   -0.3% (  -6% -    6%) 0.766
                      AndHighLow     1566.83      (2.7%)     1562.47      (2.7%)   -0.3% (  -5% -    5%) 0.745
            MedTermDayTaxoFacets       48.00      (3.5%)       47.89      (3.6%)   -0.2% (  -7% -    7%) 0.836
           HighTermDayOfYearSort      812.55      (2.8%)      811.49      (2.5%)   -0.1% (  -5% -    5%) 0.878
                         MedTerm     2390.59      (3.5%)     2387.70      (3.5%)   -0.1% (  -6% -    7%) 0.912
            BrowseDateTaxoFacets       25.05      (9.2%)       25.03      (9.3%)   -0.1% ( -16% -   20%) 0.984
           BrowseMonthSSDVFacets       16.00     (18.9%)       15.99     (19.0%)   -0.1% ( -31% -   46%) 0.992
             LowIntervalsOrdered      276.77      (3.5%)      276.66      (3.6%)   -0.0% (  -6% -    7%) 0.972
                      OrHighHigh       26.41      (4.4%)       26.40      (4.4%)   -0.0% (  -8% -    9%) 0.978
                     MedSpanNear       35.38      (1.0%)       35.38      (1.1%)   -0.0% (  -2% -    2%) 0.969
                      TermDTSort      648.08      (1.0%)      648.35      (1.3%)    0.0% (  -2% -    2%) 0.909
       BrowseDayOfYearTaxoFacets       25.12      (9.5%)       25.13      (9.5%)    0.0% ( -17% -   21%) 0.988
                    OrNotHighLow     1192.74      (3.1%)     1193.69      (2.6%)    0.1% (  -5% -    5%) 0.929
            HighIntervalsOrdered        1.79      (3.3%)        1.79      (3.3%)    0.1% (  -6% -    6%) 0.934
                         Respell       46.93      (1.3%)       46.97      (1.3%)    0.1% (  -2% -    2%) 0.825
                     LowSpanNear       11.96      (0.8%)       11.97      (0.9%)    0.1% (  -1% -    1%) 0.702
     BrowseRandomLabelSSDVFacets       10.14      (9.0%)       10.16      (9.4%)    0.1% ( -16% -   20%) 0.961
                       LowPhrase      438.44      (1.6%)      439.23      (1.4%)    0.2% (  -2% -    3%) 0.701
        AndHighHighDayTaxoFacets       28.52      (1.9%)       28.57      (1.8%)    0.2% (  -3% -    3%) 0.746
             MedIntervalsOrdered       64.28      (2.3%)       64.42      (2.3%)    0.2% (  -4% -    4%) 0.777
                    OrHighNotLow     1854.91      (2.7%)     1858.96      (3.9%)    0.2% (  -6% -    6%) 0.836
                    HighSpanNear       12.65      (1.4%)       12.69      (1.7%)    0.3% (  -2% -    3%) 0.579
          OrHighMedDayTaxoFacets       14.34      (3.0%)       14.38      (4.3%)    0.3% (  -6% -    7%) 0.809
                          Fuzzy1       81.66      (0.9%)       81.90      (1.1%)    0.3% (  -1% -    2%) 0.357
       BrowseDayOfYearSSDVFacets       13.40     (12.7%)       13.44     (12.5%)    0.3% ( -22% -   29%) 0.942
         AndHighMedDayTaxoFacets       34.33      (2.1%)       34.45      (1.7%)    0.3% (  -3% -    4%) 0.569
                          Fuzzy2       76.36      (0.8%)       76.64      (0.9%)    0.4% (  -1% -    2%) 0.187
            HighTermTitleBDVSort       12.49      (6.6%)       12.55      (6.6%)    0.4% ( -11% -   14%) 0.846
                        PKLookup      179.34      (2.2%)      180.48      (2.3%)    0.6% (  -3% -    5%) 0.364
                      AndHighMed      268.67      (3.8%)      270.48      (3.9%)    0.7% (  -6% -    8%) 0.579
                    OrNotHighMed     1006.62      (4.1%)     1013.59      (4.3%)    0.7% (  -7% -    9%) 0.604
                        Wildcard      133.41      (2.7%)      134.34      (2.7%)    0.7% (  -4% -    6%) 0.417
                     AndHighHigh       74.34      (4.7%)       74.95      (5.2%)    0.8% (  -8% -   11%) 0.608
                          IntNRQ       82.56      (3.5%)       83.70      (4.9%)    1.4% (  -6% -   10%) 0.306

Given this, I'd prefer to move forward with the change (since it doesn't seem to be doing any harm to common cases covered in luceneutil and is helping in the filtering use-cases I'm benchmarking), but I'd like your opinion. Given how central of a component this is to scoring, I want to make sure I'm not being overly biased towards the use-case I'm focused on at the cost of regressing elsewhere. Thanks again for your inputs!

github-actions · 2024-01-08T12:24:53Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

jpountz

Very sorry for the lag, I'm terrible at keeping track of PRs. The explanation of when this change helps and the change itself look good to me.

github-actions · 2024-01-24T00:14:41Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

github-actions · 2025-10-17T00:27:49Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

gsmiller added 2 commits November 14, 2022 08:44

GH#11922: Allow DisjunctionDISIApproximation to short-circuit

4757f52

changes and migrate

6b2d5c6

gsmiller commented Nov 14, 2022

View reviewed changes

gsmiller mentioned this pull request Nov 14, 2022

DisjunctionDISIApproximation / DisjunctionScorer can short-circuit #11922

Open

jpountz reviewed Nov 16, 2022

View reviewed changes

pr feedback

420330a

gsmiller mentioned this pull request Jan 2, 2023

Better skipping for multi-term queries with a FILTER rewrite. #12055

Merged

github-actions bot added the Stale label Jan 8, 2024

jpountz approved these changes Jan 8, 2024

View reviewed changes

github-actions bot removed the Stale label Jan 9, 2024

github-actions bot added the Stale label Jan 24, 2024

github-actions bot removed the Stale label Oct 2, 2025

github-actions bot added the Stale label Oct 17, 2025


		See LUCENE-10558 for more details and workarounds.

		### DisjunctionDISIApproximation behavior change

Conversation

gsmiller commented Nov 14, 2022

Description

Uh oh!

gsmiller Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

jpountz Nov 15, 2022

Choose a reason for hiding this comment

Uh oh!

gsmiller Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

jpountz commented Nov 15, 2022

Uh oh!

gsmiller commented Nov 15, 2022

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

jpountz Nov 16, 2022

Choose a reason for hiding this comment

Uh oh!

gsmiller Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

jpountz Nov 16, 2022

Choose a reason for hiding this comment

Uh oh!

gsmiller Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

gsmiller commented Nov 18, 2022

Uh oh!

gsmiller commented Nov 29, 2022

Uh oh!

github-actions bot commented Jan 8, 2024

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 24, 2024

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants