Save memory on numeric sig terms when not top#56789
Conversation
This saves memory when running numeric significant terms which are not at the top level by merging its collection into numeric terms and relying on the optimization that we made in elastic#55873.
|
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
|
Interesting! I ran some performance tests of this change and they don't look good. Something is up! before: https://gist.github.com/nik9000/28dc2a9e3309a33ea39d1c169462535e#file-gistfile1-txt-L110 This is pretty odd because when I was running some smaller tests locally everything looked faster. |
|
OK! I tracked it down. And I found a nice little improvement I could make. Before: https://gist.github.com/nik9000/28dc2a9e3309a33ea39d1c169462535e#file-gistfile1-txt-L110 About 66% performance improvement when the numeric |
|
@not-napoleon, I'm fairly comfortable with this now! I'll run some performance tests on the latest code, but I think it is an improvement, |
polyfractal
left a comment
There was a problem hiding this comment.
Left a few comments, pretty ok with it but allowing buckets to become mutable makes me a bit nervous. It was previously just contained to Sigterms and still not great, but now it'll "leak" to the whole terms family. We've had some hairy bugs in the past due to buckets being mutable (accidentally get's changed during reduction, or not understand that the bucket isn't "final" until later, etc).
It cleans up a lot of not-so-nice bits, so I don't want to throw the baby out with the bath water, but did want to mention it in case you had an idea. Not sure it'll be possible to "fix" since sigterms needs that functionality as it's currently designed.
|
|
||
| abstract Supplier<B> emptyBucketBuilder(long owningBucketOrd); | ||
|
|
||
| abstract void updateBucket(B spare, BucketOrdsEnum ordsEnum, long docCount) throws IOException; |
There was a problem hiding this comment.
Could we slap a TODO/warning on this? I put a warning on the old updateScore() method because it violates the bucket immutability that pretty much all other aggs have, and can lead to some really subtle bugs if you're not careful. And the new setup means it could affect any of the terms aggs, not just sigterms, so the surface area of accidental misuse is a lot larger.
There was a problem hiding this comment.
All of the buckets for all of the terms family are mutable. They all use that PriorityQueue/spare/updateTop stuff.
There was a problem hiding this comment.
Yeah, you're right I think I confused myself, mixing up the reason for that warning. I think it might have come about from Rollup or a pipeline issue because the score is only updated/added at reduction and not ctor, so anyone artificially creating/recreating a bucket will get bad data.
or something, but yeah you're right, term family buckets are already mutable 👍
| return result; | ||
| } | ||
|
|
||
| abstract String describe(); |
There was a problem hiding this comment.
Could we add some light javadocs to a few of these abstract methods? Still working through them so perhaps they are self-explanatory in code (or by method name). buildResult / buildBuckets buildSubAggs / buildResult / buildEmptyResult, etc
| * Lookup frequencies for terms. | ||
| */ | ||
| private static class LookupBackgroundFrequencies implements BackgroundFrequencies { | ||
| // TODO a reference to the factory is weird - probably should be reference to what we need from it. |
| boolean collectsFromSingleBucket, | ||
| Map<String, Object> metadata) throws IOException { | ||
|
|
||
| assert collectsFromSingleBucket; |
There was a problem hiding this comment.
Should we just throw an exception here? I'm assuming relatively bad things happen if this condition is violated and the user will get garbage results (or a different exception)?
There was a problem hiding this comment.
Garbage, I think. It enforced by the code below and it mirrors what we do with terms. And I plan to drop it before release anyway, but I can certainly switch it. Because plans don't always work out.
There was a problem hiding this comment.
Ok, cool. If it's enforced already no big deal 👍
) This saves memory when running numeric significant terms which are not at the top level by merging its collection into numeric terms and relying on the optimization that we made in elastic#55873.
This saves memory when running numeric significant terms which are not
at the top level by merging its collection into numeric terms and relying
on the optimization that we made in #55873.