Save memory when rare_terms is not on top#57948
Conversation
This uses the optimization that we started making in elastic#55873 for `rare_terms` to save a bit of memory when that aggregation is not on the top level.
|
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
polyfractal
left a comment
There was a problem hiding this comment.
Left a few comments, mostly around naming :) Only left them on the Long agg, but applies equally to String.
Otherwise LGTM :)
| * Can definitively say if a member does not exist (no false negatives), but may say an item exists | ||
| * when it does not (has false positives). Similar in usage to a Bloom Filter. | ||
| * | ||
| * <p> |
There was a problem hiding this comment.
I hate javadocs so much :( optimizing rendered readability while sacrificing IDE readability :(
Thanks for fixing this :)
| long keepCount = 0; | ||
| long[] mergeMap = new long[(int) bucketOrds.size()]; | ||
| Arrays.fill(mergeMap, -1); | ||
| long size = 0; |
There was a problem hiding this comment.
Hmm, this is a bit confusingly named I think? Maybe currentOffset or something? Not sure, but size feels a bit confusing.
| LongRareTerms.Bucket bucket = new LongRareTerms.Bucket(ordsEnum.value(), docCount, null, format); | ||
| bucket.bucketOrd = mergeMap[(int) ordsEnum.ord()] = size + ordsToCollect.add(ordsEnum.value()); | ||
| buckets.add(bucket); | ||
| keepCount++; |
There was a problem hiding this comment.
Should we just change this to a boolean flag? hasDeletions or whatever?
There was a problem hiding this comment.
I think we need to perform the merge if we don't keep all the buckets. We can remove buckets for two reasons now!
- The key is above the threshold.
- The
owningBucketOrdisn't selected.
This counter will catch both ways. I couldn't come up with a cleaner way to do it.
| // need to take care of dups | ||
| for (int i = 0; i < valuesCount; ++i) { | ||
| BytesRef bytes = values.nextValue(); | ||
| if (filter != null && !filter.accept(bytes)) { |
There was a problem hiding this comment.
Nit: !filter.accept() :)
(Also I realize the irony since the original code had that and it was my fault :) )
| // Make a note when one of the ords has been deleted | ||
| deletionCount += 1; | ||
| filter.add(oldKey); | ||
| public InternalAggregation[] buildAggregations(long[] owningBucketOrds) throws IOException { |
There was a problem hiding this comment.
General comment about this method: we have a lot of "ords" being referenced and it's hard to keep track of which ord is which. E.g. we have the bucket ordinals that our parent is requesting we build, and then we have the bucket ordinals from each of those instances that we are collecting into buckets
Not sure how, but if we could find a way to rename the variables to help identify or disambiguate I think it would help a bunch.
|
run elasticsearch-ci/default-distro |
|
run elasticsearch-ci/1 |
nik9000
left a comment
There was a problem hiding this comment.
I'll see about cleaning up the "ords ords ords ords" stuff too.
| LongRareTerms.Bucket bucket = new LongRareTerms.Bucket(ordsEnum.value(), docCount, null, format); | ||
| bucket.bucketOrd = mergeMap[(int) ordsEnum.ord()] = size + ordsToCollect.add(ordsEnum.value()); | ||
| buckets.add(bucket); | ||
| keepCount++; |
There was a problem hiding this comment.
I think we need to perform the merge if we don't keep all the buckets. We can remove buckets for two reasons now!
- The key is above the threshold.
- The
owningBucketOrdisn't selected.
This counter will catch both ways. I couldn't come up with a cleaner way to do it.
| // need to take care of dups | ||
| for (int i = 0; i < valuesCount; ++i) { | ||
| BytesRef bytes = values.nextValue(); | ||
| if (filter != null && !filter.accept(bytes)) { |
|
Thanks @polyfractal ! |
This uses the optimization that we started making in elastic#55873 for `rare_terms` to save a bit of memory when that aggregation is not on the top level.
This uses the optimization that we started making in #55873 for
rare_termsto save a bit of memory when that aggregation is not on thetop level.