Speed up conjunctive queries that need scores. by jpountz · Pull Request #14690 · apache/lucene

jpountz · 2025-05-20T21:09:30Z

Calls to DocIdSetIterator#nextDoc, DocIdSetIterator#advance and
SimScorer#score are currently interleaved and include lots of conditionals.
This builds up on #14679 and refactors the code a bit to make it eligible to
auto-vectorization and better pipelining.

This effectively speeds up conjunctive queries (e.g. AndHighHigh) but also
disjunctive queries that run as conjunctive queries in practice (e.g.
OrHighHigh).

Note that this builds on #14679, only the last commit touches conjunctive queries. I will clean up this PR when #14679 is merged but wanted to show the benefits for conjunctive queries as well. Note that unlike #14679 this change helps when dynamic pruning kicks in.

In the below luceneutil run on wikibigall, the baseline is ##14679 and the modified version is this PR:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
             FilteredAndHighHigh       65.84      (1.7%)       58.24      (4.7%)  -11.5% ( -17% -   -5%) 0.000
     FilteredAnd2Terms2StopWords      178.39      (1.4%)      160.44      (3.1%)  -10.1% ( -14% -   -5%) 0.000
            FilteredAndStopWords       45.13      (1.6%)       40.62      (3.7%)  -10.0% ( -15% -   -4%) 0.000
               FilteredAnd3Terms      180.48      (1.7%)      164.45      (3.5%)   -8.9% ( -13% -   -3%) 0.000
              FilteredAndHighMed      126.06      (2.3%)      122.18      (5.8%)   -3.1% ( -10% -    5%) 0.100
                            Term      443.74      (5.4%)      434.91      (4.0%)   -2.0% ( -10% -    7%) 0.325
                         Prefix3      162.93      (3.6%)      160.08      (3.8%)   -1.7% (  -8% -    5%) 0.269
                        Wildcard       93.80      (2.7%)       92.21      (2.2%)   -1.7% (  -6% -    3%) 0.108
                 FilteredPrefix3      153.82      (3.3%)      151.33      (3.4%)   -1.6% (  -8% -    5%) 0.255
             CombinedAndHighHigh       11.56      (2.4%)       11.38      (1.8%)   -1.5% (  -5% -    2%) 0.088
             CountFilteredPhrase       25.61      (0.7%)       25.29      (1.6%)   -1.2% (  -3% -    0%) 0.014
               TermDayOfYearSort      288.56      (1.8%)      285.55      (3.8%)   -1.0% (  -6% -    4%) 0.408
             CountFilteredOrMany       27.66      (1.4%)       27.37      (2.1%)   -1.0% (  -4% -    2%) 0.176
                    FilteredTerm      160.14      (2.2%)      158.48      (2.3%)   -1.0% (  -5% -    3%) 0.280
                CountAndHighHigh      363.16      (2.1%)      359.46      (3.1%)   -1.0% (  -6% -    4%) 0.365
                     CountOrMany       30.58      (2.0%)       30.29      (3.3%)   -1.0% (  -6% -    4%) 0.400
                 CountOrHighHigh      350.61      (1.7%)      347.24      (3.5%)   -1.0% (  -6% -    4%) 0.412
                    CombinedTerm       31.02      (3.2%)       30.77      (4.5%)   -0.8% (  -8% -    7%) 0.623
                  CountOrHighMed      368.27      (1.7%)      365.32      (2.3%)   -0.8% (  -4% -    3%) 0.346
               FilteredOrHighMed      153.81      (1.1%)      152.60      (1.0%)   -0.8% (  -2% -    1%) 0.082
                       CountTerm     8469.33      (4.2%)     8402.93      (4.6%)   -0.8% (  -9% -    8%) 0.676
                IntervalsOrdered        2.27      (2.9%)        2.25      (2.8%)   -0.8% (  -6% -    5%) 0.525
         CountFilteredOrHighHigh      137.81      (0.8%)      136.75      (1.2%)   -0.8% (  -2% -    1%) 0.088
                 AndHighOrMedMed       46.40      (1.7%)       46.05      (1.6%)   -0.8% (  -4% -    2%) 0.283
              FilteredOrHighHigh       67.91      (2.0%)       67.44      (1.4%)   -0.7% (  -4% -    2%) 0.348
                      DismaxTerm      485.58      (3.1%)      482.44      (3.1%)   -0.6% (  -6% -    5%) 0.621
      FilteredOr2Terms2StopWords      148.50      (1.3%)      147.59      (0.8%)   -0.6% (  -2% -    1%) 0.192
                  FilteredPhrase       32.88      (1.3%)       32.68      (1.4%)   -0.6% (  -3% -    2%) 0.285
                          Fuzzy2       86.28      (2.2%)       85.78      (2.4%)   -0.6% (  -5% -    4%) 0.554
                 CountAndHighMed      312.28      (2.3%)      310.49      (1.6%)   -0.6% (  -4% -    3%) 0.493
                          Fuzzy1      103.00      (2.0%)      102.41      (2.4%)   -0.6% (  -4% -    3%) 0.544
                  FilteredOrMany       16.58      (1.0%)       16.49      (1.7%)   -0.5% (  -3% -    2%) 0.387
          CountFilteredOrHighMed      149.15      (0.6%)      148.43      (0.9%)   -0.5% (  -2% -    1%) 0.156
                          Phrase       14.41      (2.6%)       14.35      (2.6%)   -0.4% (  -5% -    4%) 0.682
                  FilteredIntNRQ      301.78      (0.8%)      300.57      (1.0%)   -0.4% (  -2% -    1%) 0.308
                      TermDTSort      391.99      (3.5%)      390.51      (4.4%)   -0.4% (  -8% -    7%) 0.823
                FilteredOr3Terms      167.33      (1.1%)      166.79      (0.8%)   -0.3% (  -2% -    1%) 0.431
             FilteredOrStopWords       46.47      (2.6%)       46.33      (1.6%)   -0.3% (  -4% -    3%) 0.751
                          IntNRQ      307.66      (0.8%)      307.16      (1.0%)   -0.2% (  -1% -    1%) 0.677
              CombinedOrHighHigh       18.97      (3.0%)       18.95      (1.7%)   -0.1% (  -4% -    4%) 0.902
                   TermTitleSort       88.05      (6.3%)       87.94      (5.0%)   -0.1% ( -10% -   11%) 0.961
                     CountPhrase        4.12      (1.5%)        4.12      (3.0%)    0.1% (  -4% -    4%) 0.899
                   TermMonthSort     3184.22      (2.0%)     3190.71      (2.8%)    0.2% (  -4% -    5%) 0.846
                      OrHighRare      281.18      (7.6%)      281.99      (6.4%)    0.3% ( -12% -   15%) 0.923
                        PKLookup      320.85      (4.9%)      322.76      (3.8%)    0.6% (  -7% -    9%) 0.750
               CombinedOrHighMed       72.21      (3.0%)       73.11      (1.8%)    1.2% (  -3% -    6%) 0.235
                          OrMany       20.50      (5.1%)       20.92      (2.3%)    2.1% (  -5% -    9%) 0.218
              CombinedAndHighMed       38.96      (1.9%)       40.24      (1.9%)    3.3% (   0% -    7%) 0.000
                 DismaxOrHighMed      169.36      (4.7%)      184.90      (2.2%)    9.2% (   2% -   16%) 0.000
                DismaxOrHighHigh      115.69      (5.0%)      127.23      (3.0%)   10.0% (   1% -   18%) 0.000
                       And3Terms      171.52      (4.0%)      206.29      (7.0%)   20.3% (   8% -   32%) 0.000
             And2Terms2StopWords      161.46      (3.6%)      195.68      (5.2%)   21.2% (  12% -   31%) 0.000
                AndMedOrHighHigh       65.34      (2.5%)       79.36      (1.6%)   21.5% (  16% -   26%) 0.000
                        Or3Terms      164.69      (5.1%)      200.39      (5.5%)   21.7% (  10% -   33%) 0.000
              Or2Terms2StopWords      157.62      (5.2%)      194.06      (4.7%)   23.1% (  12% -   34%) 0.000
                       OrHighMed      186.05      (7.2%)      246.49      (3.1%)   32.5% (  20% -   46%) 0.000
                    AndStopWords       29.56      (5.7%)       39.37     (10.9%)   33.2% (  15% -   52%) 0.000
                     OrStopWords       31.83      (8.0%)       44.28     (10.7%)   39.1% (  18% -   62%) 0.000
                      AndHighMed      132.87      (3.3%)      185.85      (1.8%)   39.9% (  33% -   46%) 0.000
                      OrHighHigh       49.97      (7.6%)       71.15      (3.7%)   42.4% (  28% -   58%) 0.000
                     AndHighHigh       42.25      (3.4%)       60.18      (3.2%)   42.4% (  34% -   50%) 0.000

This change helps speed up exhaustive evaluation of term queries, ie. calling `DocIdSetIterator#nextDoc()` then `Scorer#score()` in a loop. It helps in two ways: - Iteration of matching doc IDs gets a bit more efficient, especially in the case when a block of postings is encoded as a bit set. - Computation of scores now gets (auto-)vectorized. While this change doesn't help much when dynamic pruning kicks in, I'm hopeful that we can improve this in the future.

Calls to `DocIdSetIterator#nextDoc`, `DocIdSetIterator#advance` and `SimScorer#score` are currently interleaved and include lots of conditionals. This builds up on apache#14679 and refactors the code a bit to make it eligible to auto-vectorization and better pipelining. This effectively speeds up conjunctive queries (e.g. `AndHighHigh`) but also disjunctive queries that run as conjunctive queries in practice (e.g. `OrHighHigh`).

jpountz · 2025-05-22T11:51:02Z

I'm superseding this change with a more general one for now, which doesn't introduce new public APIs: #14701. We can look into taking ideas from this PR as follow-ups.

HUSTERGS · 2025-07-19T04:14:38Z

I create a new PR #14968 about this one, currently the core logic is nearly identical to this PR, but I'm planning to dig more about this approach, hope you don't mind. : )

jpountz · 2025-07-19T08:05:09Z

I don't mind at all. :)

jpountz added 6 commits May 16, 2025 13:26

Simplify.

5b2bb4d

CHANGES

aa24297

Fix name.

fd97a02

Improve docs.

3833804

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking May 20, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking May 20, 2025

github-actions bot added module:core/index module:core/search module:core/codecs module:test-framework labels May 20, 2025

jpountz closed this May 22, 2025

github-project-automation bot moved this from Open to Closed in OpenSearch Lucene & Core Performance Tracking May 22, 2025

HUSTERGS mentioned this pull request Jul 19, 2025

Brings back Scorer#applyAsRequiredClause #14968

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up conjunctive queries that need scores.#14690

Speed up conjunctive queries that need scores.#14690
jpountz wants to merge 6 commits intoapache:mainfrom
jpountz:vectorized_conjunctive_queries

jpountz commented May 20, 2025 •

edited

Loading

Uh oh!

jpountz commented May 22, 2025

Uh oh!

HUSTERGS commented Jul 19, 2025

Uh oh!

jpountz commented Jul 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jpountz commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented May 22, 2025

Uh oh!

HUSTERGS commented Jul 19, 2025

Uh oh!

jpountz commented Jul 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jpountz commented May 20, 2025 •

edited

Loading