Save a little space in agg tree#53730
Conversation
This drop the "top level" pipeline aggregators from the aggregation result tree which should save a little memory and a few serialization bytes. Perhaps more imporantly, this provides a mechanism by which we can remove *all* pipelines from the aggregation result tree. This will save quite a bit of space when pipelines are deep in the tree. Sadly, doing this isn't simple because of backwards compatibility. Nodes before 7.7.0 *need* those pipelines. We provide them by setting passing a `Supplier<PipelineTree>` into the root of the aggregation tree that we only call if we need to serialize to a version before 7.7.0. This solution works for cross cluster search because we always reduce the aggregations in each remote cluster and then forward them back to the coordinating node. Its quite possible that the coordinating node needs the pipeline (say it is version 7.1.0) and the gateway node in the remote cluster doesn't (version 7.7.0). In that case the data nodes won't send the pipeline aggregations back to the gateway node. Critically, the gateway node *will* send the pipeline aggregations back to the coordinating node. This is all managed with that `Supplier<PipelineTree>`, but *how* it is managed is a bit tricky.
|
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
polyfractal
left a comment
There was a problem hiding this comment.
Left a few comments mainly around testing. Think this looks good! Serialization changes always give me the cold sweats but I can't see any flaws currently with the approach.
| - match: { aggregations.cluster.buckets.0.key: "local_cluster" } | ||
| - match: { aggregations.cluster.buckets.0.doc_count: 5 } | ||
|
|
||
| # once more, this time with a pipeline agg |
There was a problem hiding this comment.
Hmm, should we maybe split this out into it's own test? Just thinking that long multi-step yaml tests can be tricky to debug sometimes.
Haven't looked at the rest of the tests in this yaml though, so it might not be easy to move the indexing (or whatever else) steps up to the setup though.
There was a problem hiding this comment.
I'll move it, sure! They can get hard to debug.
There was a problem hiding this comment.
Ok - this is a test that doesn't clear indices after it runs. So moving things around is a bit more complex than we'd like to be honest. I can do it, but I think it should wait for a followup.
There was a problem hiding this comment.
Ah ok, no worries then. not worth re-arranging everything for 👍
| * Setting the pipeline tree source to null is here is correct but | ||
| * only because we don't immediately pass the InternalAggregations | ||
| * off to another node. Instead, we always reduce together with | ||
| * many aggregations and that always adds the |
There was a problem hiding this comment.
Looks like the comment trails off without finishing it's thought :)
| - match: { aggregations.cluster.buckets.0.doc_count: 5 } | ||
|
|
||
| # once more, this time with a pipeline agg | ||
| - do: |
There was a problem hiding this comment.
Should we have a similar test for the rolling-upgrade module? Theoretically it should be the same as CCS, but it might also smoke out different issues due to heterogeneous serialization inside the same cluster (instead of funneling through a gateway).
There was a problem hiding this comment.
It'd be great to have a "mixed cluster CCS" test. I talked that one through with @javanna and we don't have one now and probably don't want to build one just for this.
| terms: | ||
| field: f1.keyword | ||
| aggs: | ||
| s: |
There was a problem hiding this comment.
Should we add a non-top-level pipeline agg just to confirm they aren't affected?
There was a problem hiding this comment.
I figured I'd get it in my next PR about non-top-level pipeline aggs, but I'm happy to do it now!
|
@polyfractal, I think this is ready for another round! |
|
Thanks @polyfractal ! |
|
I've updated all of the version numbers in the description to 7.8.0 because this is not making the 7.7.0 release train. |
This drop the "top level" pipeline aggregators from the aggregation result tree which should save a little memory and a few serialization bytes. Perhaps more imporantly, this provides a mechanism by which we can remove *all* pipelines from the aggregation result tree. This will save quite a bit of space when pipelines are deep in the tree. Sadly, doing this isn't simple because of backwards compatibility. Nodes before 7.7.0 *need* those pipelines. We provide them by setting passing a `Supplier<PipelineTree>` into the root of the aggregation tree that we only call if we need to serialize to a version before 7.7.0. This solution works for cross cluster search because we always reduce the aggregations in each remote cluster and then forward them back to the coordinating node. Its quite possible that the coordinating node needs the pipeline (say it is version 7.1.0) and the gateway node in the remote cluster doesn't (version 7.7.0). In that case the data nodes won't send the pipeline aggregations back to the gateway node. Critically, the gateway node *will* send the pipeline aggregations back to the coordinating node. This is all managed with that `Supplier<PipelineTree>`, but *how* it is managed is a bit tricky.
This drop the "top level" pipeline aggregators from the aggregation result tree which should save a little memory and a few serialization bytes. Perhaps more imporantly, this provides a mechanism by which we can remove *all* pipelines from the aggregation result tree. This will save quite a bit of space when pipelines are deep in the tree. Sadly, doing this isn't simple because of backwards compatibility. Nodes before 7.7.0 *need* those pipelines. We provide them by setting passing a `Supplier<PipelineTree>` into the root of the aggregation tree that we only call if we need to serialize to a version before 7.7.0. This solution works for cross cluster search because we always reduce the aggregations in each remote cluster and then forward them back to the coordinating node. Its quite possible that the coordinating node needs the pipeline (say it is version 7.1.0) and the gateway node in the remote cluster doesn't (version 7.7.0). In that case the data nodes won't send the pipeline aggregations back to the gateway node. Critically, the gateway node *will* send the pipeline aggregations back to the coordinating node. This is all managed with that `Supplier<PipelineTree>`, but *how* it is managed is a bit tricky.
|
Backport done! I'm going to leave the |
Updates a few versions in serialization because we didn't make the 7.7.0 release train.
This fixes pipeline aggregations used in cross cluster search from an older version of Elasticsearch to a newer version of Elasticsearch. I broke this in elastic#53730 when I was too aggressive in shutting off serialization of pipeline aggs. In particular, this comes up when the coordinating node is pre-7.8.0 and the gateway node is on or after 7.8.0. The fix is another step down the line to remove pipeline aggregators from the aggregation tree. Sort of. It create a new `List<PipelineAggregator>` member in `InternalAggregation` *but* it is only used for bwc serialization and it is fed by the mechanism established in elastic#53730 to read the pipelines from the
This fixes pipeline aggregations used in cross cluster search from an older version of Elasticsearch to a newer version of Elasticsearch. I broke this in #53730 when I was too aggressive in shutting off serialization of pipeline aggs. In particular, this comes up when the coordinating node is pre-7.8.0 and the gateway node is on or after 7.8.0. The fix is another step down the line to remove pipeline aggregators from the aggregation tree. Sort of. It create a new `List<PipelineAggregator>` member in `InternalAggregation` *but* it is only used for bwc serialization and it is fed by the mechanism established in #53730 to read the pipelines from the
This fixes pipeline aggregations used in cross cluster search from an older version of Elasticsearch to a newer version of Elasticsearch. I broke this in elastic#53730 when I was too aggressive in shutting off serialization of pipeline aggs. In particular, this comes up when the coordinating node is pre-7.8.0 and the gateway node is on or after 7.8.0. The fix is another step down the line to remove pipeline aggregators from the aggregation tree. Sort of. It create a new `List<PipelineAggregator>` member in `InternalAggregation` *but* it is only used for bwc serialization and it is fed by the mechanism established in elastic#53730 to read the pipelines from the
This fixes pipeline aggregations used in cross cluster search from an older version of Elasticsearch to a newer version of Elasticsearch. I broke this in #53730 when I was too aggressive in shutting off serialization of pipeline aggs. In particular, this comes up when the coordinating node is pre-7.8.0 and the gateway node is on or after 7.8.0. The fix is another step down the line to remove pipeline aggregators from the aggregation tree. Sort of. It create a new `List<PipelineAggregator>` member in `InternalAggregation` *but* it is only used for bwc serialization and it is fed by the mechanism established in #53730 to read the pipelines from the
This drop the "top level" pipeline aggregators from the aggregation
result tree which should save a little memory and a few serialization
bytes. Perhaps more imporantly, this provides a mechanism by which we
can remove all pipelines from the aggregation result tree. This will
save quite a bit of space when pipelines are deep in the tree.
Sadly, doing this isn't simple because of backwards compatibility. Nodes
before 7.8.0 need those pipelines. We provide them by setting passing
a
Supplier<PipelineTree>into the root of the aggregation tree that weonly call if we need to serialize to a version before 7.8.0.
This solution works for cross cluster search because we always reduce
the aggregations in each remote cluster and then forward them back to
the coordinating node. Its quite possible that the coordinating node
needs the pipeline (say it is version 7.1.0) and the gateway node in the
remote cluster doesn't (version 7.8.0). In that case the data nodes
won't send the pipeline aggregations back to the gateway node.
Critically, the gateway node will send the pipeline aggregations back
to the coordinating node. This is all managed with that
Supplier<PipelineTree>, but how it is managed is a bit tricky.