Speed up histogram collection in a similar way as disjunction counts.#14273
Speed up histogram collection in a similar way as disjunction counts.#14273jpountz merged 8 commits intoapache:mainfrom
Conversation
This attempts to generalize the `IndexSearcher#count` optimization from PR apache#12415 to histogram facets by introducing specialization for counting the number of matching docs in a range of doc IDs. Currently, disjunctions are dense conjunctions both internally collect `DocIdStream`s backed by a bitset. In the future, we could make more queries collect whole `DocIdStream`s at once to speed up collection, e.g. `MatchAllDocsQuery` or doc-value range queries that take advantage of a sparse index.
|
@epotyom You may be interested in taking a look. |
epotyom
left a comment
There was a problem hiding this comment.
I like the idea! Looks like we can do similar trick for range facets and long values facets?
| @Override | ||
| public int count(int upTo) throws IOException { | ||
| assert fullyConsumed == false : "A terminal operation has already been called"; | ||
| int count = stream.count(); |
| return count[0]; | ||
| } | ||
|
|
||
| /** Return {@code true} if this stream may have remaining doc IDs. */ |
There was a problem hiding this comment.
Maybe I'm nitpicking, but is it worth mentioning that it must eventually returning false, otherwise return true may sound like a correct implementation?
This is right. |
|
+1 to this optimization. Love the idea! |
|
Quick update: we now have more queries that collect hits using |
I think we could optimize these use-cases even further by potentially skipping over docs that don't fall into any of the ranges/values as well. With the histogram collection use-case, we case about the entire value range of the field we're interested in, but that's not necessarily true of these other use-cases. If we have a skipper, I think we ought to also be able to use competitive iterators to jump over blocks of docs we know we won't collect based on their values? Maybe we need a spin-off issue :). I created #14406 |
| * Count the number of doc IDs in this stream that are below the given {@code upTo}. These doc IDs | ||
| * may not be consumed again later. | ||
| */ | ||
| public int count(int upTo) throws IOException { |
There was a problem hiding this comment.
This only becomes an optimization if we specialize this method right? The specializations I'm aware of rely on FixedBitSet#cardinality. Are you thinking of peeking into these bit sets to provide cardinality up to the specific doc? (Or maybe I'm missing something?)
There was a problem hiding this comment.
Are you thinking of peeking into these bit sets to provide cardinality up to the specific doc? (Or maybe I'm missing something?)
Yes exactly. I have something locally already, I need to beef up testing a bit.
The bitset-based DocIdStream is one interesting implementation, the other interesting implementation is the one that is backed by a range of doc IDs that all match. It is internally used by queries that fully match a segment (e.g. PointRangeQuery when all the segment's values are contained in the query range, or MatchAllDocsQuery) or queries on fields that are part of (or correlate with) the index sort fields. See #14312 for reference.
This is correct. I plan on doing something similar when sorting: it is safe to skip blocks whose values all compare worse than the current k-th value. It's similar to what block-max WAND/MAXSCORE do: when a block's best possible score is less than the k-th best score so far, it can safely be skipped. |
|
It should be ready for review now. Now that |
|
I'll try to run some simple benchmarks next. |
|
I played with the geonames dataset, by filtering out docs that don't have a value for the
I also checked wikibigall, no slowdowns were detected. |
gsmiller
left a comment
There was a problem hiding this comment.
Left some minor feedback. Looks good though!
| * Count the number of doc IDs in this stream that are below the given {@code upTo}. These doc IDs | ||
| * may not be consumed again later. | ||
| */ | ||
| // Note: it's abstract rather than having a default impl that delegates to #forEach because doing |
There was a problem hiding this comment.
Thanks for adding this comment! +1 to the rationale as well.
There was a problem hiding this comment.
Ah right, it was based on your previous feedback. :)
| final class DISIDocIdStream extends DocIdStream { | ||
|
|
||
| private final DocIdSetIterator iterator; | ||
| private final int to; |
There was a problem hiding this comment.
minor: maybe max to be consistent with the other implementations?
| } | ||
| // If the collector is just interested in the count, loading in a bit set and counting bits is | ||
| // often faster than incrementing a counter on every call to nextDoc(). | ||
| assert spare.scanIsEmpty(); |
There was a problem hiding this comment.
Nice. I appreciate this assert in here!
| cardinality += Long.bitCount(bits); | ||
| from += numBitsTilNextWord; | ||
| } | ||
|
|
There was a problem hiding this comment.
minor: what about assert (from & 0x3F) == 0; right here?
| forEach(bits, from + base, consumer); | ||
| from += numBitsTilNextWord; | ||
| } | ||
|
|
There was a problem hiding this comment.
minor: same suggestion of assert (from & 0x3F) == 0;
| if (acceptDocs != null) { | ||
| // In this case, live docs have not been applied yet. | ||
| acceptDocs.applyMask(matching, base); | ||
| } |
There was a problem hiding this comment.
If I'm looking at this diff properly, I don't think you should need this block of code at all since you're still applying acceptDocs immediately prior?
There was a problem hiding this comment.
Thanks for catching, this bit of code was mistakenly duplicated.
|
|
||
| @Override | ||
| public void forEach(int upTo, CheckedIntConsumer<IOException> consumer) throws IOException { | ||
| if (upTo >= this.upTo) { |
There was a problem hiding this comment.
I think you want > here instead? (>= still functionally works since forEach is tolerant of the == case, but I think you want to short circuit if upTo == this.upTo?)
(Also possible I'm getting this wrong... but I did make the changes locally and all the testing still seems to pass)
There was a problem hiding this comment.
You're right, I'll change this.
There was a problem hiding this comment.
(For reference, it's not as much for short-circuiting as for the fact that logic under the if block would fail if trying to move backwards. But since we're doing a check anyway, I agree it's also worth excluding the trivial case when the range of docs to collect is empty.)
|
|
||
| @Override | ||
| public int count(int upTo) throws IOException { | ||
| if (upTo >= this.upTo) { |
There was a problem hiding this comment.
Another place where I think you want >
|
|
||
| @Override | ||
| public void forEach(int upTo, CheckedIntConsumer<IOException> consumer) throws IOException { | ||
| if (upTo >= this.upTo) { |
There was a problem hiding this comment.
Another place where I think you want >
|
|
||
| @Override | ||
| public int count(int upTo) throws IOException { | ||
| if (upTo >= this.upTo) { |
There was a problem hiding this comment.
And one more spot where I think you want >?
|
|
||
| // Now handle bits between the last complete word and to | ||
| if ((to & 0x3F) != 0) { | ||
| long bits = this.bits[to >> 6] << -to; |
There was a problem hiding this comment.
minor: one other small comment. I noticed in the new forEach method you added you use long bits = this.bits[to >> 6] & ((1L << to) - 1);. Is the rationale here that you need to persist the correct number of trailing zeros in the forEach implementation but not in this case since you're doing a bit count? Is this approach (shifting by -to) more performant (I ask since you could use the same approach as forEach here too for consistency, so I assume you had a reason ;))
There was a problem hiding this comment.
You are right. We need the low "to % 64" bits of bits[to >> 6]. x << -to is nice because it requires fewer instructions but it also moves bits to a different index. This makes iterating over set bits slightly more cumbersome, but doesn't matter for counting bits. Hence why forEach applies a mask instead. I haven't checked actual performance, I suspect that it doesn't matter in practice.
There was a problem hiding this comment.
Thanks for the detailed explanation. This makes sense to me, it just left me scratching my head for a little bit initially to figure out the intention behind the different approaches :)
gsmiller
left a comment
There was a problem hiding this comment.
Looks good. Thanks for incorporating the minor feedback!
…#14273) This attempts to generalize the `IndexSearcher#count` optimization from PR #12415 to histogram facets by introducing specialization for counting the number of matching docs in a range of doc IDs. Currently, disjunctions are dense conjunctions both internally collect `DocIdStream`s backed by a bitset. In the future, we could make more queries collect whole `DocIdStream`s at once to speed up collection, e.g. `MatchAllDocsQuery` or doc-value range queries that take advantage of a sparse index.
This generalizes the
IndexSearcher#countoptimization from PR #12415 to histogram facets by introducing specialization for counting the number of matching docs in a range of doc IDs.Currently, disjunctions and dense conjunctions both internally collect
DocIdStreams backed by a bitset. In the future, we could make more queries collect wholeDocIdStreams at once to speed up collection, e.g.MatchAllDocsQueryor doc-value range queries that take advantage of a sparse index.