Skip to content

Fix HistogramCollector to not create zero-count buckets.#14421

Merged
jpountz merged 4 commits intoapache:mainfrom
jpountz:dont_create_zero_count_buckets
Mar 31, 2025
Merged

Fix HistogramCollector to not create zero-count buckets.#14421
jpountz merged 4 commits intoapache:mainfrom
jpountz:dont_create_zero_count_buckets

Conversation

@jpountz
Copy link
Copy Markdown
Contributor

@jpountz jpountz commented Mar 28, 2025

If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.

If a bucket in the middle of the range doesn't match docs, it would be returned
with a count of zero. Better not return it at all.
Copy link
Copy Markdown
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

.add(NumericDocValuesField.newSlowRangeQuery("f", Long.MIN_VALUE, 2), Occur.SHOULD)
.add(NumericDocValuesField.newSlowRangeQuery("f", 10, Long.MAX_VALUE), Occur.SHOULD)
.build();
actualCounts = searcher.search(query, new HistogramCollectorManager("f", 4));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I was wondering if checkMaxBuckets(collectorCounts.size(), maxBuckets) might cause some of the tests earlier expecting exception to fail. But, all the tests with counts[i] == 0, use 1024 as maxBuckets which is comfortably over collectorCounts.size()

Maybe, we can specify 3 as maxBuckets for even stronger condition:

searcher.search(query, new HistogramCollectorManager("f", 4, 3));

collectorCounts.addTo(leafMinBucket + i, counts[i]);
}
}
checkMaxBuckets(collectorCounts.size(), maxBuckets);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While unrelated to this change, I am wondering if we should check eagerly to prevent unnecessary iterations:

if (counts[i] != 0) {
  collectorCounts.addTo(leafMinBucket + i, counts[i]);
  checkMaxBuckets(collectorCounts.size(), maxBuckets);
}

We are doing similar validation in other places:

if (bucket != prevBucket) {
  counts.addTo(bucket, 1);
  checkMaxBuckets(counts.size(), maxBuckets);
  prevBucket = bucket;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok the way it is. The end goal is to prevent unbounded heap allocation. In this case, the amount of excess heap we may allocate is bounded by 1024 entries, so I'd err on the side of simplicity by not checking the number of buckets in the loop?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fair. I was expecting low bound like 1024, just wanted to confirm!

jpountz added 3 commits March 31, 2025 12:23
If a bucket in the middle of the range doesn't match docs, it would be returned
with a count of zero. Better not return it at all.
@jpountz jpountz merged commit 076f4e4 into apache:main Mar 31, 2025
7 checks passed
@jpountz jpountz deleted the dont_create_zero_count_buckets branch March 31, 2025 14:41
jpountz added a commit that referenced this pull request Mar 31, 2025
If a bucket in the middle of the range doesn't match docs, it would be returned
with a count of zero. Better not return it at all.
jpountz added a commit that referenced this pull request Mar 31, 2025
If a bucket in the middle of the range doesn't match docs, it would be returned
with a count of zero. Better not return it at all.
@jpountz
Copy link
Copy Markdown
Contributor Author

jpountz commented Mar 31, 2025

I backported to branch_10_2 since this is a bugfix cc @iverase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants