Conversation
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates apache#12358
|
Note: this is just a proof of concept to discuss the idea of integrating at the collector level, more work is needed to add more tests, more docs, integrating in the test framework ( |
|
To me a big question with this API is whether we should consider methods on the |
|
I added documentation and tests, it's ready for review. I settled on making consumption of |
Wow! I'll try to review soon. Thanks @jpountz! |
mikemccand
left a comment
There was a problem hiding this comment.
This looks great @jpountz! Thank you! It's wonderful to see cross fertilization / inspiration from the ongoing Tantivy <-> Lucene comparison resulting in optimizations like this. I'd love to see the other direction too (Tantivy porting over 2-phase iteration, or pulsing in terms dictionary or so).
Sorry for the slow review.
I'm trying to add count(...) to nightly benchmarks -- it's regolding now. I'd love to start benchmarking charts for these before we land this opto so we can fully appreciate / document the "pop" :)
I wonder what other queries could (later) benefit from DocIdStream bulk collection ...
| final Bucket[] buckets = new Bucket[SIZE]; | ||
| // One bucket per doc ID in the window, non-null if scores are needed or if frequencies need to be | ||
| // counted | ||
| final Bucket[] buckets; |
There was a problem hiding this comment.
I wonder if switching this to parallel arrays, for maybe better CPU cache locality, would show any speedup (separate issue!). Or maybe "structs" (value objects) when Java finally gets them.
Though, the inlined matching bitset is sort of already a parallel array and maybe gets most of the gains.
There was a problem hiding this comment.
It's been this way for a very (very very very) long time, but I agree it would probably perform better with parallel arrays!
There was a problem hiding this comment.
LOL +1 to the extra very instances above!
| @Override | ||
| public void forEach(CheckedIntConsumer<IOException> consumer) throws IOException { | ||
| long[] matching = BooleanScorer.this.matching; | ||
| Bucket[] buckets = BooleanScorer.this.buckets; |
There was a problem hiding this comment.
We should (later!) maybe rename Bucket to OneHit or DocHit or so, to make it clear it represents details of a single doc hit.
| } | ||
| for (int i = 0; i < buckets.length; i++) { | ||
| buckets[i] = new Bucket(); | ||
| if (needsScores || minShouldMatch > 1) { |
There was a problem hiding this comment.
Might this also optimize other cases, where we are using BooleanScorer in non-scoring cases (MUST_NOT or FILTER)? Or do we never use BooleanScorer for these clauses and it's really just the count API that we are accelerating here?
There was a problem hiding this comment.
Yes, we also use BooleanScorer when there is a mix of SHOULD and MUST_NOT clauses. But not when there are FILTER clauses.
There was a problem hiding this comment.
OK. Maybe if the FILTER clause is high enough cardinality, at some point BS1 becomes worth it. Restrictive (low cardinality) filters is where BS2 should win.
| if (needsScores == false) { | ||
| // OrCollector calls score() all the time so we have to explicitly | ||
| // disable scoring in order to avoid decoding useless norms | ||
| scorer = BooleanWeight.disableScoring(scorer); |
There was a problem hiding this comment.
Nice -- this change is a more effective way to disable scoring than wrapping in a no-op / fake scorer!
|
|
||
| /** Like {@link IntConsumer}, but may throw checked exceptions. */ | ||
| @FunctionalInterface | ||
| public interface CheckedIntConsumer<T extends Exception> { |
There was a problem hiding this comment.
Darned ubiquitous IOException all throughout Lucene!!
There was a problem hiding this comment.
How about s/throws IOException//g and s/IOException/IOExceptionUnchecked/g ?
There was a problem hiding this comment.
I would like LeafCollector#collect to be a valid method reference that implements this functional interface, and I don't want to change the signature of LeafCollector#collect. If we remove the exception here, it would force the default implementation of LeafCollector#collect(DocIdStream) to change from this:
default void collect(DocIdStream stream) throws IOException {
stream.forEach(this::collect);
}to that
default void collect(DocIdStream stream) throws IOException {
stream.forEach(doc -> {
try {
collect(doc);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
});
}which I like less than introducing this functional interface.
There was a problem hiding this comment.
My suggestion was global search and replace throughout Lucene, only half-serious
There was a problem hiding this comment.
Sooooo tempting. This IOExcption pollution has been so irritating over the years ... we could maybe make all the entry points (IndexSearcher#search, #count, etc.) throw IOException so callers know "yes you are searching an on-disk index still so stuff could go badly wrong with those transistors", but internally use the unchecked form maybe. Though, that just pushes the virus "up" to our users ...
| final long cost; | ||
| final boolean needsScores; | ||
|
|
||
| final class OrCollector implements LeafCollector { |
There was a problem hiding this comment.
Do we really only use this BooleanScorer for pure disjunctive cases now? I wonder if it might be faster than BS2 for certain conjunctive cases, e.g. if the clauses all have "similar" cost. (Separate issue).
There was a problem hiding this comment.
Indeed, we never use it for conjunctions. It's probably faster than BS2 for conjunctions at times, it would be interesting to find a good heuristic.
There was a problem hiding this comment.
Query optimization is so tricky!
| * @see LeafCollector#collect(DocIdStream) | ||
| * @lucene.experimental | ||
| */ | ||
| public abstract class DocIdStream { |
There was a problem hiding this comment.
I'm not sure where to document this, but this stream is not in general (though could be) holding ALL matching hits for a given collection situation (query) right? As used from BooleanScorer it is just one window's worth of hits (a 2048 chunk of docid space) at once? I guess the right place to make this clear is in the new collect(DocIdStream) method?
| void collect(int doc) throws IOException; | ||
|
|
||
| /** | ||
| * Bulk-collect doc IDs. The default implementation calls {@code stream.forEach(this::collect)}. |
There was a problem hiding this comment.
Can we note that this might be a chunk/window of docids, and it's always sequential/in-order with respect to other calls to collect (e.g. collect(int doc)). Is it valid for a caller to mix & match calls to both collect methods here? I would think so, but we are not yet doing that since this change will always collect with one or the other.
| context.reader().getLiveDocs(), | ||
| 0, | ||
| DocIdSetIterator.NO_MORE_DOCS); | ||
| assertEquals(expectedCount[0], actualCount[0]); |
There was a problem hiding this comment.
Nice! So we count both fast (DocIdStream#count) and slow (one by one) way and confirm they agree.
| @Override | ||
| public void collect(DocIdStream stream) throws IOException { | ||
| docIdStream[0] = true; | ||
| LeafCollector.super.collect(stream); |
There was a problem hiding this comment.
This then forwards to our collect(int doc) method below right? So we are forcing counting "the slow way" (one by one).
+1 I'll wait for a few data point before merging
I tried to think about this too.
Queries that produce bitsets could also implement a similar optimization, e.g. (numeric) Term queries could theoretically return a DocIdStream per block of 128 doc IDs, where decoding would happen lazily at the beginning of And like you already suggested, we could handle some conjunctions if we ran them through BS1. In general, deletions will tend to disable this optimization. (BS1 is a notable case when deletions would not disable this optimization) It might help to have a |
+1, this is a neat idea! Often deletes are sparse (apps try hard to merge them away) ... at Amazon product search we have insanely aggressively asked |
mikemccand
left a comment
There was a problem hiding this comment.
Thanks @jpountz! What an exciting change! And I love that it comes from cross-fertilizing from Tantivy's awesome search implementation/optimizations.
This is a subset of apache#12415, which I'm extracting to its own pull request in order to have separate data points in nightly benchmarks. Results on `wikimedium10m` and `wikinightly` counting tasks: ``` CountTerm 4624.91 (6.4%) 4581.34 (6.4%) -0.9% ( -12% - 12%) 0.640 CountAndHighMed 280.03 (4.5%) 280.15 (4.4%) 0.0% ( -8% - 9%) 0.974 CountPhrase 7.22 (3.0%) 7.24 (1.8%) 0.3% ( -4% - 5%) 0.728 CountAndHighHigh 52.84 (4.9%) 53.12 (5.6%) 0.5% ( -9% - 11%) 0.755 PKLookup 232.01 (3.6%) 235.45 (2.8%) 1.5% ( -4% - 8%) 0.144 CountOrHighHigh 42.37 (6.1%) 56.04 (9.1%) 32.3% ( 16% - 50%) 0.000 CountOrHighMed 30.56 (6.5%) 40.46 (9.8%) 32.4% ( 15% - 52%) 0.000 ```
This is a subset of #12415, which I'm extracting to its own pull request in order to have separate data points in nightly benchmarks. Results on `wikimedium10m` and `wikinightly` counting tasks: ``` CountTerm 4624.91 (6.4%) 4581.34 (6.4%) -0.9% ( -12% - 12%) 0.640 CountAndHighMed 280.03 (4.5%) 280.15 (4.4%) 0.0% ( -8% - 9%) 0.974 CountPhrase 7.22 (3.0%) 7.24 (1.8%) 0.3% ( -4% - 5%) 0.728 CountAndHighHigh 52.84 (4.9%) 53.12 (5.6%) 0.5% ( -9% - 11%) 0.755 PKLookup 232.01 (3.6%) 235.45 (2.8%) 1.5% ( -4% - 8%) 0.144 CountOrHighHigh 42.37 (6.1%) 56.04 (9.1%) 32.3% ( 16% - 50%) 0.000 CountOrHighMed 30.56 (6.5%) 40.46 (9.8%) 32.4% ( 15% - 52%) 0.000 ```
This is a subset of #12415, which I'm extracting to its own pull request in order to have separate data points in nightly benchmarks. Results on `wikimedium10m` and `wikinightly` counting tasks: ``` CountTerm 4624.91 (6.4%) 4581.34 (6.4%) -0.9% ( -12% - 12%) 0.640 CountAndHighMed 280.03 (4.5%) 280.15 (4.4%) 0.0% ( -8% - 9%) 0.974 CountPhrase 7.22 (3.0%) 7.24 (1.8%) 0.3% ( -4% - 5%) 0.728 CountAndHighHigh 52.84 (4.9%) 53.12 (5.6%) 0.5% ( -9% - 11%) 0.755 PKLookup 232.01 (3.6%) 235.45 (2.8%) 1.5% ( -4% - 8%) 0.144 CountOrHighHigh 42.37 (6.1%) 56.04 (9.1%) 32.3% ( 16% - 50%) 0.000 CountOrHighMed 30.56 (6.5%) 40.46 (9.8%) 32.4% ( 15% - 52%) 0.000 ```
|
After merging a subset of this PR in #12475, there remains a ~25% speedup when counting hits on |
|
Counting tasks after integrating #12488: |
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates #12358
This attempts to generalize the `IndexSearcher#count` optimization from PR apache#12415 to histogram facets by introducing specialization for counting the number of matching docs in a range of doc IDs. Currently, disjunctions are dense conjunctions both internally collect `DocIdStream`s backed by a bitset. In the future, we could make more queries collect whole `DocIdStream`s at once to speed up collection, e.g. `MatchAllDocsQuery` or doc-value range queries that take advantage of a sparse index.
…#14273) This attempts to generalize the `IndexSearcher#count` optimization from PR #12415 to histogram facets by introducing specialization for counting the number of matching docs in a range of doc IDs. Currently, disjunctions are dense conjunctions both internally collect `DocIdStream`s backed by a bitset. In the future, we could make more queries collect whole `DocIdStream`s at once to speed up collection, e.g. `MatchAllDocsQuery` or doc-value range queries that take advantage of a sparse index.
…#14273) This attempts to generalize the `IndexSearcher#count` optimization from PR #12415 to histogram facets by introducing specialization for counting the number of matching docs in a range of doc IDs. Currently, disjunctions are dense conjunctions both internally collect `DocIdStream`s backed by a bitset. In the future, we could make more queries collect whole `DocIdStream`s at once to speed up collection, e.g. `MatchAllDocsQuery` or doc-value range queries that take advantage of a sparse index.
This introduces
LeafCollector#collect(DocIdStream)to enable collectors to collect batches of doc IDs at once.BooleanScorertakes advantage of this by creating aDocIdStreamwhosecount()method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches.On wikimedium10m, this yields a ~20% speedup when counting hits for the
title OR 12query (2.9M hits).Relates #12358