Bump the window size of disjunctions from 2,048 to 4,096.#13605
Merged
jpountz merged 1 commit intoapache:mainfrom Jul 25, 2024
Merged
Bump the window size of disjunctions from 2,048 to 4,096.#13605jpountz merged 1 commit intoapache:mainfrom
jpountz merged 1 commit intoapache:mainfrom
Conversation
It's been pointed multiple times that a difference between Tantivy and Lucene
is the fact that Tantivy uses windows of 4,096 docs when Lucene has a 2x
smaller window size of 2,048 docs and that this might explain part of the
performance difference. luceneutil suggests that bumping the window size to
4,096 does indeed improve performance for counting queries, but not for top-k
queries. I'm still suggesting to bump the window size across the board to keep
our disjunction scorer consistent.
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
CountPhrase 3.27 (11.6%) 3.14 (8.0%) -4.1% ( -21% - 17%) 0.189
HighTermMonthSort 3521.28 (3.5%) 3481.74 (2.8%) -1.1% ( -7% - 5%) 0.262
PKLookup 289.42 (1.3%) 286.47 (2.2%) -1.0% ( -4% - 2%) 0.075
TermDTSort 352.01 (6.5%) 348.89 (5.6%) -0.9% ( -12% - 11%) 0.642
Phrase 11.85 (5.3%) 11.76 (5.0%) -0.8% ( -10% - 9%) 0.634
OrHighLow 772.82 (2.4%) 767.24 (2.1%) -0.7% ( -5% - 3%) 0.313
CountAndHighMed 120.78 (2.3%) 120.10 (2.5%) -0.6% ( -5% - 4%) 0.449
HighTermDayOfYearSort 821.48 (3.5%) 818.62 (2.7%) -0.3% ( -6% - 6%) 0.724
HighTermTitleSort 148.84 (2.9%) 148.33 (2.8%) -0.3% ( -5% - 5%) 0.700
AndHighHigh 62.36 (1.7%) 62.17 (1.8%) -0.3% ( -3% - 3%) 0.584
CountAndHighHigh 41.41 (2.5%) 41.34 (2.6%) -0.2% ( -5% - 5%) 0.836
Fuzzy1 96.24 (1.0%) 96.09 (1.2%) -0.2% ( -2% - 2%) 0.667
AndHighLow 827.59 (2.7%) 826.89 (2.4%) -0.1% ( -5% - 5%) 0.918
AndHighMed 93.35 (1.6%) 93.29 (1.7%) -0.1% ( -3% - 3%) 0.903
HighTermTitleBDVSort 16.30 (4.2%) 16.29 (6.7%) -0.0% ( -10% - 11%) 0.984
OrHighMed 153.42 (2.6%) 153.41 (2.2%) -0.0% ( -4% - 4%) 0.994
Respell 46.72 (1.3%) 46.72 (1.4%) 0.0% ( -2% - 2%) 0.975
And3Terms 155.73 (2.2%) 155.95 (1.4%) 0.1% ( -3% - 3%) 0.805
Fuzzy2 58.66 (0.9%) 58.77 (1.1%) 0.2% ( -1% - 2%) 0.566
OrHighHigh 75.70 (2.6%) 75.90 (2.3%) 0.3% ( -4% - 5%) 0.733
CountTerm 9110.00 (4.3%) 9142.10 (3.2%) 0.4% ( -6% - 8%) 0.768
AndStopWords 29.47 (2.6%) 29.57 (1.3%) 0.4% ( -3% - 4%) 0.579
And2Terms2StopWords 150.30 (2.1%) 150.86 (1.1%) 0.4% ( -2% - 3%) 0.487
OrHighRare 237.33 (5.7%) 238.26 (6.2%) 0.4% ( -10% - 13%) 0.837
MedTerm 553.55 (6.0%) 555.97 (7.7%) 0.4% ( -12% - 15%) 0.841
Wildcard 34.08 (3.2%) 34.25 (3.4%) 0.5% ( -5% - 7%) 0.630
OrNotHighLow 761.70 (3.2%) 766.33 (2.6%) 0.6% ( -5% - 6%) 0.511
Or2Terms2StopWords 156.10 (3.2%) 157.14 (1.8%) 0.7% ( -4% - 5%) 0.416
Or3Terms 156.59 (3.0%) 157.70 (1.9%) 0.7% ( -4% - 5%) 0.374
HighTerm 440.27 (5.6%) 443.89 (7.5%) 0.8% ( -11% - 14%) 0.695
LowTerm 892.27 (5.2%) 900.48 (6.8%) 0.9% ( -10% - 13%) 0.632
OrStopWords 31.88 (4.7%) 32.29 (2.6%) 1.3% ( -5% - 9%) 0.276
Prefix3 214.22 (3.4%) 217.48 (2.8%) 1.5% ( -4% - 8%) 0.124
OrHighNotHigh 247.52 (4.8%) 254.52 (5.1%) 2.8% ( -6% - 13%) 0.071
IntNRQ 144.53 (17.2%) 148.66 (17.9%) 2.9% ( -27% - 45%) 0.607
OrNotHighMed 330.23 (6.5%) 340.12 (5.4%) 3.0% ( -8% - 15%) 0.114
OrHighNotMed 285.11 (5.2%) 293.82 (6.2%) 3.1% ( -7% - 15%) 0.092
OrHighNotLow 429.94 (5.4%) 443.15 (6.8%) 3.1% ( -8% - 16%) 0.113
OrNotHighHigh 189.30 (5.9%) 195.25 (5.4%) 3.1% ( -7% - 15%) 0.079
CountOrHighMed 99.90 (22.5%) 121.78 (20.0%) 21.9% ( -16% - 83%) 0.001
CountOrHighHigh 53.76 (35.1%) 70.24 (32.5%) 30.6% ( -27% - 151%) 0.004
```
jpountz
added a commit
that referenced
this pull request
Jul 31, 2024
It's been pointed multiple times that a difference between Tantivy and Lucene is the fact that Tantivy uses windows of 4,096 docs when Lucene has a 2x smaller window size of 2,048 docs and that this might explain part of the performance difference. luceneutil suggests that bumping the window size to 4,096 does indeed improve performance for counting queries, but not for top-k queries. I'm still suggesting to bump the window size across the board to keep our disjunction scorer consistent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
It's been pointed multiple times that a difference between Tantivy and Lucene is the fact that Tantivy uses windows of 4,096 docs when Lucene has a 2x smaller window size of 2,048 docs and that this might explain part of the performance difference. luceneutil suggests that bumping the window size to 4,096 does indeed improve performance for counting queries, but not for top-k queries. I'm still suggesting to bump the window size across the board to keep our disjunction scorers consistent.
Description