Speed up histogram collection in a similar way as disjunction counts. by jpountz · Pull Request #14273 · apache/lucene

jpountz · 2025-02-21T15:41:36Z

This generalizes the IndexSearcher#count optimization from PR #12415 to histogram facets by introducing specialization for counting the number of matching docs in a range of doc IDs.

Currently, disjunctions and dense conjunctions both internally collect DocIdStreams backed by a bitset. In the future, we could make more queries collect whole DocIdStreams at once to speed up collection, e.g. MatchAllDocsQuery or doc-value range queries that take advantage of a sparse index.

This attempts to generalize the `IndexSearcher#count` optimization from PR apache#12415 to histogram facets by introducing specialization for counting the number of matching docs in a range of doc IDs. Currently, disjunctions are dense conjunctions both internally collect `DocIdStream`s backed by a bitset. In the future, we could make more queries collect whole `DocIdStream`s at once to speed up collection, e.g. `MatchAllDocsQuery` or doc-value range queries that take advantage of a sparse index.

jpountz · 2025-02-21T15:42:35Z

@epotyom You may be interested in taking a look.

epotyom

I like the idea! Looks like we can do similar trick for range facets and long values facets?

epotyom · 2025-02-23T11:33:04Z

lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingLeafCollector.java

+    @Override
+    public int count(int upTo) throws IOException {
+      assert fullyConsumed == false : "A terminal operation has already been called";
+      int count = stream.count();


Should it be count(upTo)?

epotyom · 2025-02-24T13:59:01Z

lucene/core/src/java/org/apache/lucene/search/DocIdStream.java

+    return count[0];
+  }
+
+  /** Return {@code true} if this stream may have remaining doc IDs. */


Maybe I'm nitpicking, but is it worth mentioning that it must eventually returning false, otherwise return true may sound like a correct implementation?

jpountz · 2025-02-24T15:59:34Z

Looks like we can do similar trick for range facets and long values facets?

This is right.

gsmiller · 2025-03-25T15:26:22Z

+1 to this optimization. Love the idea!

jpountz · 2025-03-25T21:38:58Z

Quick update: we now have more queries that collect hits using collect(DocIdStream), which makes this optimization more appealing.

gsmiller · 2025-03-26T14:15:41Z

I like the idea! Looks like we can do similar trick for range facets and long values facets?

I think we could optimize these use-cases even further by potentially skipping over docs that don't fall into any of the ranges/values as well. With the histogram collection use-case, we case about the entire value range of the field we're interested in, but that's not necessarily true of these other use-cases. If we have a skipper, I think we ought to also be able to use competitive iterators to jump over blocks of docs we know we won't collect based on their values?

Maybe we need a spin-off issue :). I created #14406

gsmiller · 2025-03-26T14:19:18Z

lucene/core/src/java/org/apache/lucene/search/DocIdStream.java

+   * Count the number of doc IDs in this stream that are below the given {@code upTo}. These doc IDs
+   * may not be consumed again later.
+   */
+  public int count(int upTo) throws IOException {


This only becomes an optimization if we specialize this method right? The specializations I'm aware of rely on FixedBitSet#cardinality. Are you thinking of peeking into these bit sets to provide cardinality up to the specific doc? (Or maybe I'm missing something?)

Are you thinking of peeking into these bit sets to provide cardinality up to the specific doc? (Or maybe I'm missing something?)

Yes exactly. I have something locally already, I need to beef up testing a bit.

The bitset-based DocIdStream is one interesting implementation, the other interesting implementation is the one that is backed by a range of doc IDs that all match. It is internally used by queries that fully match a segment (e.g. PointRangeQuery when all the segment's values are contained in the query range, or MatchAllDocsQuery) or queries on fields that are part of (or correlate with) the index sort fields. See #14312 for reference.

jpountz · 2025-03-26T16:20:17Z

If we have a skipper, I think we ought to also be able to use competitive iterators to jump over blocks of docs we know we won't collect based on their values?

This is correct. I plan on doing something similar when sorting: it is safe to skip blocks whose values all compare worse than the current k-th value. It's similar to what block-max WAND/MAXSCORE do: when a block's best possible score is less than the k-th best score so far, it can safely be skipped.

jpountz · 2025-03-26T23:20:43Z

It should be ready for review now. Now that DocIdStream has become more sophisticated, I extracted impls to proper classes that could be better tested. This causes some diffs in our boolean scorers, hence the high number of lines changed.

jpountz · 2025-03-26T23:25:50Z

I'll try to run some simple benchmarks next.

jpountz · 2025-03-27T14:33:00Z

I played with the geonames dataset, by filtering out docs that don't have a value for the elevation field (2.3M docs left), enabling index sorting on the elevation field and computing histograms on the elevation field with a bucket width of 100.

Query		Latency on main (ms)	Latency on branch (ms)
`MatchAllDocsQuery`	Uses `RangeDocIdStream` under the hood	6.9	4.3
`featureClass:(S P)`	Matches spots or cities, 1.2 matching docs, uses `BitSetDocIdStream` under the hood	4.8	2.4

I also checked wikibigall, no slowdowns were detected.

gsmiller

Left some minor feedback. Looks good though!

gsmiller · 2025-03-27T17:17:29Z

lucene/core/src/java/org/apache/lucene/search/DocIdStream.java

+   * Count the number of doc IDs in this stream that are below the given {@code upTo}. These doc IDs
+   * may not be consumed again later.
+   */
+  // Note: it's abstract rather than having a default impl that delegates to #forEach because doing


Thanks for adding this comment! +1 to the rationale as well.

Ah right, it was based on your previous feedback. :)

gsmiller · 2025-03-27T17:28:06Z

lucene/core/src/java/org/apache/lucene/search/DISIDocIdStream.java

+final class DISIDocIdStream extends DocIdStream {
+
+  private final DocIdSetIterator iterator;
+  private final int to;


minor: maybe max to be consistent with the other implementations?

gsmiller · 2025-03-27T17:29:09Z

lucene/core/src/java/org/apache/lucene/search/DISIDocIdStream.java

+    }
+    // If the collector is just interested in the count, loading in a bit set and counting bits is
+    // often faster than incrementing a counter on every call to nextDoc().
+    assert spare.scanIsEmpty();


Nice. I appreciate this assert in here!

gsmiller · 2025-03-27T19:37:24Z

lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java

+      cardinality += Long.bitCount(bits);
+      from += numBitsTilNextWord;
+    }
+


minor: what about assert (from & 0x3F) == 0; right here?

gsmiller · 2025-03-27T19:51:18Z

lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java

+      forEach(bits, from + base, consumer);
+      from += numBitsTilNextWord;
+    }
+


minor: same suggestion of assert (from & 0x3F) == 0;

gsmiller · 2025-03-27T20:06:11Z

lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java

+      if (acceptDocs != null) {
+        // In this case, live docs have not been applied yet.
+        acceptDocs.applyMask(matching, base);
+      }


If I'm looking at this diff properly, I don't think you should need this block of code at all since you're still applying acceptDocs immediately prior?

Thanks for catching, this bit of code was mistakenly duplicated.

gsmiller · 2025-03-27T20:11:31Z

lucene/core/src/java/org/apache/lucene/search/BitSetDocIdStream.java

+
+  @Override
+  public void forEach(int upTo, CheckedIntConsumer<IOException> consumer) throws IOException {
+    if (upTo >= this.upTo) {


I think you want > here instead? (>= still functionally works since forEach is tolerant of the == case, but I think you want to short circuit if upTo == this.upTo?)

(Also possible I'm getting this wrong... but I did make the changes locally and all the testing still seems to pass)

You're right, I'll change this.

(For reference, it's not as much for short-circuiting as for the fact that logic under the if block would fail if trying to move backwards. But since we're doing a check anyway, I agree it's also worth excluding the trivial case when the range of docs to collect is empty.)

gsmiller · 2025-03-27T20:11:46Z

lucene/core/src/java/org/apache/lucene/search/BitSetDocIdStream.java

+
+  @Override
+  public int count(int upTo) throws IOException {
+    if (upTo >= this.upTo) {


Another place where I think you want >

gsmiller · 2025-03-27T20:12:01Z

lucene/core/src/java/org/apache/lucene/search/RangeDocIdStream.java

+
+  @Override
+  public void forEach(int upTo, CheckedIntConsumer<IOException> consumer) throws IOException {
+    if (upTo >= this.upTo) {


Another place where I think you want >

gsmiller · 2025-03-27T20:12:18Z

lucene/core/src/java/org/apache/lucene/search/RangeDocIdStream.java

+
+  @Override
+  public int count(int upTo) throws IOException {
+    if (upTo >= this.upTo) {


And one more spot where I think you want >?

gsmiller · 2025-03-27T23:35:09Z

lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java

+
+    // Now handle bits between the last complete word and to
+    if ((to & 0x3F) != 0) {
+      long bits = this.bits[to >> 6] << -to;


minor: one other small comment. I noticed in the new forEach method you added you use long bits = this.bits[to >> 6] & ((1L << to) - 1);. Is the rationale here that you need to persist the correct number of trailing zeros in the forEach implementation but not in this case since you're doing a bit count? Is this approach (shifting by -to) more performant (I ask since you could use the same approach as forEach here too for consistency, so I assume you had a reason ;))

You are right. We need the low "to % 64" bits of bits[to >> 6]. x << -to is nice because it requires fewer instructions but it also moves bits to a different index. This makes iterating over set bits slightly more cumbersome, but doesn't matter for counting bits. Hence why forEach applies a mask instead. I haven't checked actual performance, I suspect that it doesn't matter in practice.

Thanks for the detailed explanation. This makes sense to me, it just left me scratching my head for a little bit initially to figure out the intention behind the different approaches :)

gsmiller

Looks good. Thanks for incorporating the minor feedback!

…#14273) This attempts to generalize the `IndexSearcher#count` optimization from PR #12415 to histogram facets by introducing specialization for counting the number of matching docs in a range of doc IDs. Currently, disjunctions are dense conjunctions both internally collect `DocIdStream`s backed by a bitset. In the future, we could make more queries collect whole `DocIdStream`s at once to speed up collection, e.g. `MatchAllDocsQuery` or doc-value range queries that take advantage of a sparse index.

github-actions bot added module:core/search module:test-framework module:sandbox labels Feb 21, 2025

epotyom reviewed Feb 24, 2025

View reviewed changes

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Mar 25, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Mar 25, 2025

jpountz added 2 commits March 25, 2025 22:40

Merge branch 'main' into speed_up_histogram

e8c8382

Review feedback

b37b9e8

gsmiller mentioned this pull request Mar 26, 2025

Leverage sparse doc value indexes for range and value facet collection #14406

Open

gsmiller reviewed Mar 26, 2025

View reviewed changes

impl

6a46415

jpountz marked this pull request as ready for review March 26, 2025 23:18

fix ecjLintTest failures

6d7910d

Only advance skipper if the stream has remaining docs.

6823cce

tidy

33d2b59

gsmiller mentioned this pull request Mar 27, 2025

Enable collectors to take advantage of pre-aggregated data. #14401

Merged

gsmiller reviewed Mar 27, 2025

View reviewed changes

review feedback

e234439

gsmiller reviewed Mar 27, 2025

View reviewed changes

gsmiller approved these changes Mar 28, 2025

View reviewed changes

jpountz merged commit 82200c0 into apache:main Mar 28, 2025
7 checks passed

github-project-automation bot moved this from Open to Merged in OpenSearch Lucene & Core Performance Tracking Mar 28, 2025

jpountz deleted the speed_up_histogram branch March 28, 2025 14:52

jpountz added this to the 10.2.0 milestone Mar 28, 2025

jpountz mentioned this pull request Apr 4, 2025

Generalize count() optimization for faceting quickwit-oss/tantivy#2616

Open

Conversation

jpountz commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented Feb 21, 2025

Uh oh!

epotyom left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz commented Feb 24, 2025

Uh oh!

gsmiller commented Mar 25, 2025

Uh oh!

jpountz commented Mar 25, 2025

Uh oh!

gsmiller commented Mar 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz commented Mar 26, 2025

Uh oh!

jpountz commented Mar 26, 2025

Uh oh!

jpountz commented Mar 26, 2025

Uh oh!

jpountz commented Mar 27, 2025

Uh oh!

gsmiller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsmiller left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

jpountz commented Feb 21, 2025 •

edited

Loading