Fix HistogramCollector to not create zero-count buckets.#14421
Fix HistogramCollector to not create zero-count buckets.#14421jpountz merged 4 commits intoapache:mainfrom
Conversation
If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.
| .add(NumericDocValuesField.newSlowRangeQuery("f", Long.MIN_VALUE, 2), Occur.SHOULD) | ||
| .add(NumericDocValuesField.newSlowRangeQuery("f", 10, Long.MAX_VALUE), Occur.SHOULD) | ||
| .build(); | ||
| actualCounts = searcher.search(query, new HistogramCollectorManager("f", 4)); |
There was a problem hiding this comment.
Initially, I was wondering if checkMaxBuckets(collectorCounts.size(), maxBuckets) might cause some of the tests earlier expecting exception to fail. But, all the tests with counts[i] == 0, use 1024 as maxBuckets which is comfortably over collectorCounts.size()
Maybe, we can specify 3 as maxBuckets for even stronger condition:
searcher.search(query, new HistogramCollectorManager("f", 4, 3));
| collectorCounts.addTo(leafMinBucket + i, counts[i]); | ||
| } | ||
| } | ||
| checkMaxBuckets(collectorCounts.size(), maxBuckets); |
There was a problem hiding this comment.
While unrelated to this change, I am wondering if we should check eagerly to prevent unnecessary iterations:
if (counts[i] != 0) {
collectorCounts.addTo(leafMinBucket + i, counts[i]);
checkMaxBuckets(collectorCounts.size(), maxBuckets);
}
We are doing similar validation in other places:
if (bucket != prevBucket) {
counts.addTo(bucket, 1);
checkMaxBuckets(counts.size(), maxBuckets);
prevBucket = bucket;
}
There was a problem hiding this comment.
I think it's ok the way it is. The end goal is to prevent unbounded heap allocation. In this case, the amount of excess heap we may allocate is bounded by 1024 entries, so I'd err on the side of simplicity by not checking the number of buckets in the loop?
There was a problem hiding this comment.
Sounds fair. I was expecting low bound like 1024, just wanted to confirm!
If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.
If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.
If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.
|
I backported to branch_10_2 since this is a bugfix cc @iverase |
If a bucket in the middle of the range doesn't match docs, it would be returned with a count of zero. Better not return it at all.