Optimize disjunction counts. by jpountz · Pull Request #12415 · apache/lucene

jpountz · 2023-07-05T11:49:07Z

This introduces LeafCollector#collect(DocIdStream) to enable collectors to collect batches of doc IDs at once. BooleanScorer takes advantage of this by creating a DocIdStream whose count() method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches.

On wikimedium10m, this yields a ~20% speedup when counting hits for the title OR 12 query (2.9M hits).

Relates #12358

This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates apache#12358

jpountz · 2023-07-05T11:50:48Z

Note: this is just a proof of concept to discuss the idea of integrating at the collector level, more work is needed to add more tests, more docs, integrating in the test framework (AssertingLeafCollector), etc.

jpountz · 2023-07-06T07:02:16Z

To me a big question with this API is whether we should consider methods on the DocIdStream terminal or not. If we do this, then this may enable more optimizations later on, e.g. it would be legal to create such objects that are backed by an iterator. But on the other hand, you wouldn't be able to propagate this optimization in a MultiLeafCollector since it would be illegal for each sub LeafCollector to consume the DocIdStream independently.

jpountz · 2023-07-07T16:23:47Z

I added documentation and tests, it's ready for review. I settled on making consumption of DocIdStreams terminal, on the basis that it wouldn't add much value to use this optimization in a MultiCollector anyway. I also removed some overhead that is mostly unrelated to this change, and counting title OR 12 is now 80% faster compared to main.

mikemccand · 2023-07-21T11:53:20Z

counting title OR 12 is now 80% faster compared to main.

Wow! I'll try to review soon. Thanks @jpountz!

mikemccand

This looks great @jpountz! Thank you! It's wonderful to see cross fertilization / inspiration from the ongoing Tantivy <-> Lucene comparison resulting in optimizations like this. I'd love to see the other direction too (Tantivy porting over 2-phase iteration, or pulsing in terms dictionary or so).

Sorry for the slow review.

I'm trying to add count(...) to nightly benchmarks -- it's regolding now. I'd love to start benchmarking charts for these before we land this opto so we can fully appreciate / document the "pop" :)

I wonder what other queries could (later) benefit from DocIdStream bulk collection ...

mikemccand · 2023-07-21T11:57:38Z

lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java

-  final Bucket[] buckets = new Bucket[SIZE];
+  // One bucket per doc ID in the window, non-null if scores are needed or if frequencies need to be
+  // counted
+  final Bucket[] buckets;


I wonder if switching this to parallel arrays, for maybe better CPU cache locality, would show any speedup (separate issue!). Or maybe "structs" (value objects) when Java finally gets them.

Though, the inlined matching bitset is sort of already a parallel array and maybe gets most of the gains.

It's been this way for a very (very very very) long time, but I agree it would probably perform better with parallel arrays!

LOL +1 to the extra very instances above!

mikemccand · 2023-07-21T11:59:15Z

lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java

+    @Override
+    public void forEach(CheckedIntConsumer<IOException> consumer) throws IOException {
+      long[] matching = BooleanScorer.this.matching;
+      Bucket[] buckets = BooleanScorer.this.buckets;


We should (later!) maybe rename Bucket to OneHit or DocHit or so, to make it clear it represents details of a single doc hit.

mikemccand · 2023-07-27T10:20:08Z

lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java

    }
-    for (int i = 0; i < buckets.length; i++) {
-      buckets[i] = new Bucket();
+    if (needsScores || minShouldMatch > 1) {


Might this also optimize other cases, where we are using BooleanScorer in non-scoring cases (MUST_NOT or FILTER)? Or do we never use BooleanScorer for these clauses and it's really just the count API that we are accelerating here?

Yes, we also use BooleanScorer when there is a mix of SHOULD and MUST_NOT clauses. But not when there are FILTER clauses.

OK. Maybe if the FILTER clause is high enough cardinality, at some point BS1 becomes worth it. Restrictive (low cardinality) filters is where BS2 should win.

mikemccand · 2023-07-27T10:21:04Z

lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java

-      if (needsScores == false) {
-        // OrCollector calls score() all the time so we have to explicitly
-        // disable scoring in order to avoid decoding useless norms
-        scorer = BooleanWeight.disableScoring(scorer);


Nice -- this change is a more effective way to disable scoring than wrapping in a no-op / fake scorer!

mikemccand · 2023-07-27T10:23:01Z

lucene/core/src/java/org/apache/lucene/search/CheckedIntConsumer.java

+
+/** Like {@link IntConsumer}, but may throw checked exceptions. */
+@FunctionalInterface
+public interface CheckedIntConsumer<T extends Exception> {


Darned ubiquitous IOException all throughout Lucene!!

It's viral!

How about s/throws IOException//g and s/IOException/IOExceptionUnchecked/g ?

I would like LeafCollector#collect to be a valid method reference that implements this functional interface, and I don't want to change the signature of LeafCollector#collect. If we remove the exception here, it would force the default implementation of LeafCollector#collect(DocIdStream) to change from this:

default void collect(DocIdStream stream) throws IOException { stream.forEach(this::collect); }

to that

default void collect(DocIdStream stream) throws IOException { stream.forEach(doc -> { try { collect(doc); } catch (IOException e) { throw new UncheckedIOException(e); } }); }

which I like less than introducing this functional interface.

My suggestion was global search and replace throughout Lucene, only half-serious

Sooooo tempting. This IOExcption pollution has been so irritating over the years ... we could maybe make all the entry points (IndexSearcher#search, #count, etc.) throw IOException so callers know "yes you are searching an on-disk index still so stuff could go badly wrong with those transistors", but internally use the unchecked form maybe. Though, that just pushes the virus "up" to our users ...

mikemccand · 2023-07-27T10:35:44Z

lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java

  final long cost;
+  final boolean needsScores;

  final class OrCollector implements LeafCollector {


Do we really only use this BooleanScorer for pure disjunctive cases now? I wonder if it might be faster than BS2 for certain conjunctive cases, e.g. if the clauses all have "similar" cost. (Separate issue).

Indeed, we never use it for conjunctions. It's probably faster than BS2 for conjunctions at times, it would be interesting to find a good heuristic.

Query optimization is so tricky!

mikemccand · 2023-07-27T10:37:52Z

lucene/core/src/java/org/apache/lucene/search/DocIdStream.java

+ * @see LeafCollector#collect(DocIdStream)
+ * @lucene.experimental
+ */
+public abstract class DocIdStream {


I'm not sure where to document this, but this stream is not in general (though could be) holding ALL matching hits for a given collection situation (query) right? As used from BooleanScorer it is just one window's worth of hits (a 2048 chunk of docid space) at once? I guess the right place to make this clear is in the new collect(DocIdStream) method?

mikemccand · 2023-07-27T10:39:14Z

lucene/core/src/java/org/apache/lucene/search/LeafCollector.java

  void collect(int doc) throws IOException;

+  /**
+   * Bulk-collect doc IDs. The default implementation calls {@code stream.forEach(this::collect)}.


Can we note that this might be a chunk/window of docids, and it's always sequential/in-order with respect to other calls to collect (e.g. collect(int doc)). Is it valid for a caller to mix & match calls to both collect methods here? I would think so, but we are not yet doing that since this change will always collect with one or the other.

mikemccand · 2023-07-27T10:42:21Z

lucene/test-framework/src/java/org/apache/lucene/tests/search/QueryUtils.java

+          context.reader().getLiveDocs(),
+          0,
+          DocIdSetIterator.NO_MORE_DOCS);
+      assertEquals(expectedCount[0], actualCount[0]);


Nice! So we count both fast (DocIdStream#count) and slow (one by one) way and confirm they agree.

mikemccand · 2023-07-27T10:43:13Z

lucene/test-framework/src/java/org/apache/lucene/tests/search/QueryUtils.java

+            @Override
+            public void collect(DocIdStream stream) throws IOException {
+              docIdStream[0] = true;
+              LeafCollector.super.collect(stream);


This then forwards to our collect(int doc) method below right? So we are forcing counting "the slow way" (one by one).

jpountz · 2023-07-27T17:06:06Z

I'd love to start benchmarking charts for these before we land this opto so we can fully appreciate / document the "pop"

+1 I'll wait for a few data point before merging

I wonder what other queries could (later) benefit from DocIdStream bulk collection ...

I tried to think about this too.

MatchAllDocsQuery is an obvious candidate, but it's already optimized differently using Weight#count. It's probably still a good idea to implement this API on MatchAllDocsQuery so that it would help pure negations (a MatchAllDocsQuery in a MUST/FILTER/SHOULD clause, and one or more MUST_NOT clauses), as this will trigger usage of ReqExclBulkScorer which will delegate to the MatchAllDocsQuery BulkScorer.

Queries that produce bitsets could also implement a similar optimization, e.g. (numeric) range, prefix, wildcard or geo queries. I expect the cost of building the bitset to dominate the overall execution time, but it will probably still yield a noticeable speedup. A question I'm wondering there is whether we should pass the entire BitSet as a single DocIdStream or if there are reasons why we should split it anyway.

Term queries could theoretically return a DocIdStream per block of 128 doc IDs, where decoding would happen lazily at the beginning of DocIdStream#forEach and DocIdStream#count would return 128 without even decoding postings. This would require more intimate integration with the codec as we don't have the right APIs to do this at the moment.

And like you already suggested, we could handle some conjunctions if we ran them through BS1.

In general, deletions will tend to disable this optimization. (BS1 is a notable case when deletions would not disable this optimization) It might help to have a nextClearBit on Bits to be able to still apply this optimization. E.g. MatchAllDocsQuery could use Bits#nextClearBit on live docs to create a DocIdStream for every sequence of adjacent non-deleted doc IDs to speed up counting under sparse deletions.

mikemccand · 2023-07-30T10:39:30Z

It might help to have a nextClearBit on Bits to be able to still apply this optimization. E.g. MatchAllDocsQuery could use Bits#nextClearBit on live docs to create a DocIdStream for every sequence of adjacent non-deleted doc IDs to speed up counting under sparse deletions.

+1, this is a neat idea! Often deletes are sparse (apps try hard to merge them away) ... at Amazon product search we have insanely aggressively asked TieredMP to merge away deletions now, at tremendous increase of indexing cost and wee bit faster searching, the right tradeoff when using NRT segment replication to efficiently/incrementally index once and distribute the copy to many replicas for searching. Maybe open a follow-on issue for this?

mikemccand

Thanks @jpountz! What an exciting change! And I love that it comes from cross-fertilizing from Tantivy's awesome search implementation/optimizations.

This is a subset of apache#12415, which I'm extracting to its own pull request in order to have separate data points in nightly benchmarks. Results on `wikimedium10m` and `wikinightly` counting tasks: ``` CountTerm 4624.91 (6.4%) 4581.34 (6.4%) -0.9% ( -12% - 12%) 0.640 CountAndHighMed 280.03 (4.5%) 280.15 (4.4%) 0.0% ( -8% - 9%) 0.974 CountPhrase 7.22 (3.0%) 7.24 (1.8%) 0.3% ( -4% - 5%) 0.728 CountAndHighHigh 52.84 (4.9%) 53.12 (5.6%) 0.5% ( -9% - 11%) 0.755 PKLookup 232.01 (3.6%) 235.45 (2.8%) 1.5% ( -4% - 8%) 0.144 CountOrHighHigh 42.37 (6.1%) 56.04 (9.1%) 32.3% ( 16% - 50%) 0.000 CountOrHighMed 30.56 (6.5%) 40.46 (9.8%) 32.4% ( 15% - 52%) 0.000 ```

This is a subset of #12415, which I'm extracting to its own pull request in order to have separate data points in nightly benchmarks. Results on `wikimedium10m` and `wikinightly` counting tasks: ``` CountTerm 4624.91 (6.4%) 4581.34 (6.4%) -0.9% ( -12% - 12%) 0.640 CountAndHighMed 280.03 (4.5%) 280.15 (4.4%) 0.0% ( -8% - 9%) 0.974 CountPhrase 7.22 (3.0%) 7.24 (1.8%) 0.3% ( -4% - 5%) 0.728 CountAndHighHigh 52.84 (4.9%) 53.12 (5.6%) 0.5% ( -9% - 11%) 0.755 PKLookup 232.01 (3.6%) 235.45 (2.8%) 1.5% ( -4% - 8%) 0.144 CountOrHighHigh 42.37 (6.1%) 56.04 (9.1%) 32.3% ( 16% - 50%) 0.000 CountOrHighMed 30.56 (6.5%) 40.46 (9.8%) 32.4% ( 15% - 52%) 0.000 ```

jpountz · 2023-08-04T22:06:27Z

After merging a subset of this PR in #12475, there remains a ~25% speedup when counting hits on title OR 12 (same query as mentioned earlier).

jpountz · 2023-08-05T09:57:12Z

Counting tasks after integrating #12488:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                     CountPhrase       12.47      (3.2%)       12.49      (3.9%)    0.1% (  -6% -    7%) 0.909
                        PKLookup      240.85      (4.1%)      241.22      (3.6%)    0.2% (  -7% -    8%) 0.897
                       CountTerm     9110.81      (3.9%)     9163.88      (2.8%)    0.6% (  -5% -    7%) 0.586
                CountAndHighHigh       51.81      (3.5%)       52.36      (2.5%)    1.1% (  -4% -    7%) 0.274
                 CountAndHighMed      196.28      (3.8%)      198.66      (2.3%)    1.2% (  -4% -    7%) 0.222
                 CountOrHighHigh       50.41      (9.3%)       64.37     (16.3%)   27.7% (   1% -   58%) 0.000
                  CountOrHighMed       79.16      (8.7%)      105.40     (16.4%)   33.2% (   7% -   63%) 0.000

This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates #12358

This attempts to generalize the `IndexSearcher#count` optimization from PR apache#12415 to histogram facets by introducing specialization for counting the number of matching docs in a range of doc IDs. Currently, disjunctions are dense conjunctions both internally collect `DocIdStream`s backed by a bitset. In the future, we could make more queries collect whole `DocIdStream`s at once to speed up collection, e.g. `MatchAllDocsQuery` or doc-value range queries that take advantage of a sparse index.

…#14273) This attempts to generalize the `IndexSearcher#count` optimization from PR #12415 to histogram facets by introducing specialization for counting the number of matching docs in a range of doc IDs. Currently, disjunctions are dense conjunctions both internally collect `DocIdStream`s backed by a bitset. In the future, we could make more queries collect whole `DocIdStream`s at once to speed up collection, e.g. `MatchAllDocsQuery` or doc-value range queries that take advantage of a sparse index.

iter

57da108

jpountz mentioned this pull request Jul 5, 2023

Optimize count() for BooleanQuery disjunction #12358

Open

jpountz added 3 commits July 7, 2023 15:57

AssertingLeafCollector integration.

5102463

iter

2820fa1

simplify

f3fc1e6

jpountz marked this pull request as ready for review July 7, 2023 16:21

jpountz added 2 commits July 7, 2023 22:48

Add javadocs

f9ec645

tidy

2025ba6

mikemccand reviewed Jul 27, 2023

View reviewed changes

More docs.

74c378b

mikemccand approved these changes Jul 30, 2023

View reviewed changes

jpountz mentioned this pull request Jul 31, 2023

Reduce overhead of disabling scoring on BooleanScorer. #12475

Merged

jpountz added 2 commits August 4, 2023 23:43

Merge branch 'main' into optimized_disjunction_count

d871c21

iter

5b79367

Merge branch 'main' into optimized_disjunction_count

1497813

jpountz added 3 commits August 11, 2023 21:02

Merge branch 'main' into optimized_disjunction_count

4d9a7ed

CHANGES

5c237c9

s/min/base/

ba4d5e3

Handle corner case.

aaf7e89

jpountz added this to the 9.8.0 milestone Aug 11, 2023

jpountz merged commit 4d26cb2 into apache:main Aug 11, 2023

jpountz deleted the optimized_disjunction_count branch August 11, 2023 20:37

jpountz added a commit that referenced this pull request Aug 11, 2023

Move changes of #12415 to 9.8

47258cc

mikemccand mentioned this pull request Aug 18, 2023

Try out a tantivy's term dictionary format #12513

Open

jpountz mentioned this pull request Feb 21, 2025

Speed up histogram collection in a similar way as disjunction counts. #14273

Merged

Conversation

jpountz commented Jul 5, 2023

Uh oh!

jpountz commented Jul 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented Jul 7, 2023

Uh oh!

mikemccand commented Jul 21, 2023

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz commented Jul 27, 2023

Uh oh!

mikemccand commented Jul 30, 2023

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

jpountz commented Aug 4, 2023

Uh oh!

jpountz commented Aug 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jpountz commented Jul 5, 2023 •

edited

Loading

jpountz commented Jul 6, 2023 •

edited

Loading