Skip to content

Speed up histogram collection in a similar way as disjunction counts.#14273

Merged
jpountz merged 8 commits intoapache:mainfrom
jpountz:speed_up_histogram
Mar 28, 2025
Merged

Speed up histogram collection in a similar way as disjunction counts.#14273
jpountz merged 8 commits intoapache:mainfrom
jpountz:speed_up_histogram

Conversation

@jpountz
Copy link
Copy Markdown
Contributor

@jpountz jpountz commented Feb 21, 2025

This generalizes the IndexSearcher#count optimization from PR #12415 to histogram facets by introducing specialization for counting the number of matching docs in a range of doc IDs.

Currently, disjunctions and dense conjunctions both internally collect DocIdStreams backed by a bitset. In the future, we could make more queries collect whole DocIdStreams at once to speed up collection, e.g. MatchAllDocsQuery or doc-value range queries that take advantage of a sparse index.

This attempts to generalize the `IndexSearcher#count` optimization from
PR apache#12415 to histogram facets by introducing specialization for counting
the number of matching docs in a range of doc IDs.

Currently, disjunctions are dense conjunctions both internally collect
`DocIdStream`s backed by a bitset. In the future, we could make more
queries collect whole `DocIdStream`s at once to speed up collection,
e.g. `MatchAllDocsQuery` or doc-value range queries that take advantage
of a sparse index.
@jpountz
Copy link
Copy Markdown
Contributor Author

jpountz commented Feb 21, 2025

@epotyom You may be interested in taking a look.

Copy link
Copy Markdown
Contributor

@epotyom epotyom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea! Looks like we can do similar trick for range facets and long values facets?

@Override
public int count(int upTo) throws IOException {
assert fullyConsumed == false : "A terminal operation has already been called";
int count = stream.count();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be count(upTo)?

return count[0];
}

/** Return {@code true} if this stream may have remaining doc IDs. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm nitpicking, but is it worth mentioning that it must eventually returning false, otherwise return true may sound like a correct implementation?

@jpountz
Copy link
Copy Markdown
Contributor Author

jpountz commented Feb 24, 2025

Looks like we can do similar trick for range facets and long values facets?

This is right.

@gsmiller
Copy link
Copy Markdown
Contributor

+1 to this optimization. Love the idea!

@jpountz
Copy link
Copy Markdown
Contributor Author

jpountz commented Mar 25, 2025

Quick update: we now have more queries that collect hits using collect(DocIdStream), which makes this optimization more appealing.

@gsmiller
Copy link
Copy Markdown
Contributor

I like the idea! Looks like we can do similar trick for range facets and long values facets?

I think we could optimize these use-cases even further by potentially skipping over docs that don't fall into any of the ranges/values as well. With the histogram collection use-case, we case about the entire value range of the field we're interested in, but that's not necessarily true of these other use-cases. If we have a skipper, I think we ought to also be able to use competitive iterators to jump over blocks of docs we know we won't collect based on their values?

Maybe we need a spin-off issue :). I created #14406

* Count the number of doc IDs in this stream that are below the given {@code upTo}. These doc IDs
* may not be consumed again later.
*/
public int count(int upTo) throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only becomes an optimization if we specialize this method right? The specializations I'm aware of rely on FixedBitSet#cardinality. Are you thinking of peeking into these bit sets to provide cardinality up to the specific doc? (Or maybe I'm missing something?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking of peeking into these bit sets to provide cardinality up to the specific doc? (Or maybe I'm missing something?)

Yes exactly. I have something locally already, I need to beef up testing a bit.

The bitset-based DocIdStream is one interesting implementation, the other interesting implementation is the one that is backed by a range of doc IDs that all match. It is internally used by queries that fully match a segment (e.g. PointRangeQuery when all the segment's values are contained in the query range, or MatchAllDocsQuery) or queries on fields that are part of (or correlate with) the index sort fields. See #14312 for reference.

@jpountz
Copy link
Copy Markdown
Contributor Author

jpountz commented Mar 26, 2025

If we have a skipper, I think we ought to also be able to use competitive iterators to jump over blocks of docs we know we won't collect based on their values?

This is correct. I plan on doing something similar when sorting: it is safe to skip blocks whose values all compare worse than the current k-th value. It's similar to what block-max WAND/MAXSCORE do: when a block's best possible score is less than the k-th best score so far, it can safely be skipped.

@jpountz jpountz marked this pull request as ready for review March 26, 2025 23:18
@jpountz
Copy link
Copy Markdown
Contributor Author

jpountz commented Mar 26, 2025

It should be ready for review now. Now that DocIdStream has become more sophisticated, I extracted impls to proper classes that could be better tested. This causes some diffs in our boolean scorers, hence the high number of lines changed.

@jpountz
Copy link
Copy Markdown
Contributor Author

jpountz commented Mar 26, 2025

I'll try to run some simple benchmarks next.

@jpountz
Copy link
Copy Markdown
Contributor Author

jpountz commented Mar 27, 2025

I played with the geonames dataset, by filtering out docs that don't have a value for the elevation field (2.3M docs left), enabling index sorting on the elevation field and computing histograms on the elevation field with a bucket width of 100.

Query Latency on main (ms) Latency on branch (ms)
MatchAllDocsQuery Uses RangeDocIdStream under the hood 6.9 4.3
featureClass:(S P) Matches spots or cities, 1.2 matching docs, uses BitSetDocIdStream under the hood 4.8 2.4

I also checked wikibigall, no slowdowns were detected.

Copy link
Copy Markdown
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor feedback. Looks good though!

* Count the number of doc IDs in this stream that are below the given {@code upTo}. These doc IDs
* may not be consumed again later.
*/
// Note: it's abstract rather than having a default impl that delegates to #forEach because doing
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this comment! +1 to the rationale as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, it was based on your previous feedback. :)

final class DISIDocIdStream extends DocIdStream {

private final DocIdSetIterator iterator;
private final int to;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe max to be consistent with the other implementations?

}
// If the collector is just interested in the count, loading in a bit set and counting bits is
// often faster than incrementing a counter on every call to nextDoc().
assert spare.scanIsEmpty();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I appreciate this assert in here!

cardinality += Long.bitCount(bits);
from += numBitsTilNextWord;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: what about assert (from & 0x3F) == 0; right here?

forEach(bits, from + base, consumer);
from += numBitsTilNextWord;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: same suggestion of assert (from & 0x3F) == 0;

Comment on lines +168 to +171
if (acceptDocs != null) {
// In this case, live docs have not been applied yet.
acceptDocs.applyMask(matching, base);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm looking at this diff properly, I don't think you should need this block of code at all since you're still applying acceptDocs immediately prior?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching, this bit of code was mistakenly duplicated.


@Override
public void forEach(int upTo, CheckedIntConsumer<IOException> consumer) throws IOException {
if (upTo >= this.upTo) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want > here instead? (>= still functionally works since forEach is tolerant of the == case, but I think you want to short circuit if upTo == this.upTo?)

(Also possible I'm getting this wrong... but I did make the changes locally and all the testing still seems to pass)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I'll change this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For reference, it's not as much for short-circuiting as for the fact that logic under the if block would fail if trying to move backwards. But since we're doing a check anyway, I agree it's also worth excluding the trivial case when the range of docs to collect is empty.)


@Override
public int count(int upTo) throws IOException {
if (upTo >= this.upTo) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place where I think you want >


@Override
public void forEach(int upTo, CheckedIntConsumer<IOException> consumer) throws IOException {
if (upTo >= this.upTo) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place where I think you want >


@Override
public int count(int upTo) throws IOException {
if (upTo >= this.upTo) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And one more spot where I think you want >?


// Now handle bits between the last complete word and to
if ((to & 0x3F) != 0) {
long bits = this.bits[to >> 6] << -to;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: one other small comment. I noticed in the new forEach method you added you use long bits = this.bits[to >> 6] & ((1L << to) - 1);. Is the rationale here that you need to persist the correct number of trailing zeros in the forEach implementation but not in this case since you're doing a bit count? Is this approach (shifting by -to) more performant (I ask since you could use the same approach as forEach here too for consistency, so I assume you had a reason ;))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. We need the low "to % 64" bits of bits[to >> 6]. x << -to is nice because it requires fewer instructions but it also moves bits to a different index. This makes iterating over set bits slightly more cumbersome, but doesn't matter for counting bits. Hence why forEach applies a mask instead. I haven't checked actual performance, I suspect that it doesn't matter in practice.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. This makes sense to me, it just left me scratching my head for a little bit initially to figure out the intention behind the different approaches :)

Copy link
Copy Markdown
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for incorporating the minor feedback!

@jpountz jpountz merged commit 82200c0 into apache:main Mar 28, 2025
7 checks passed
@jpountz jpountz deleted the speed_up_histogram branch March 28, 2025 14:52
@jpountz jpountz added this to the 10.2.0 milestone Mar 28, 2025
jpountz added a commit that referenced this pull request Mar 28, 2025
…#14273)

This attempts to generalize the `IndexSearcher#count` optimization from
PR #12415 to histogram facets by introducing specialization for counting
the number of matching docs in a range of doc IDs.

Currently, disjunctions are dense conjunctions both internally collect
`DocIdStream`s backed by a bitset. In the future, we could make more
queries collect whole `DocIdStream`s at once to speed up collection,
e.g. `MatchAllDocsQuery` or doc-value range queries that take advantage
of a sparse index.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants