ESQL: Sum, Min, Max and Avg of constants#105454
Conversation
| // TODO: also allow aggregates once aggs on constants are supported. | ||
| // C.f. https://github.com/elastic/elasticsearch/issues/100634 |
There was a problem hiding this comment.
To be useful with aggs, we'd need to move PropagateEvalFoldables into the Substitutions batch and run this more than once - but we can worry about that later.
There was a problem hiding this comment.
Sure - it would be good to see whether it should be ran more than once since Propagate runs across the whole plan.
|
Hi @alex-spies, I've created a changelog YAML for you. |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
costin
left a comment
There was a problem hiding this comment.
LGTM so far but before commit to this approach, add other common aggs as well such as Min/Max and a non regular one such as percentile so we use the same approach across all of them.
x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats.csv-spec
Outdated
Show resolved
Hide resolved
...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/SurrogateExpression.java
Show resolved
Hide resolved
...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizer.java
Outdated
Show resolved
Hide resolved
|
Implemented I'd like to do |
| /** | ||
| * Expects | ||
| * Project[[s{r}#3, s_expr{r}#5, s_null{r}#7, w{r}#10]] | ||
| * \_Eval[[$$COUNT$s$0{r}#25 * 3[INTEGER] AS s, $$COUNT$s$0{r}#25 * 3.14[DOUBLE] AS s_expr, null[LONG] AS s_null]] |
There was a problem hiding this comment.
Ideally, we need to see in a unit test that sum([1,2]) is in fact mv_sum * count.
This test here is ok, but it doesn't really check the transformation we are doing for sum([1,2]).
There was a problem hiding this comment.
Updated - now the tests only check the result of the substitutions, and in particular we check whether we replace avg by mv_avg etc.
|
|
||
| var project = as(plan, Project.class); | ||
| var limit = as(project.child(), Limit.class); | ||
| var row = as(limit.child(), Row.class); |
There was a problem hiding this comment.
This is probably an error here with row.
There was a problem hiding this comment.
Discussed with Andrei:
- We both thought that it would be nicer to use a
LocalRelationwith 0 attributes and a subsequentEvalon top
that containss = mv_avg(...), ...- esp. given thatRowis mostly used in tests and not production. - It turns out that
LocalRelationwith 0 attributes is not correct, though, as this would automatically have 0 rows as well; it seems thatRowis the most fitting concept, here, after all.
There was a problem hiding this comment.
To stick closer to the original code in SubstituteSurrogates, I handle this case now by
- creating a placeholder
Rowwhen we end up not having any actual aggs, and - handling the evaluation of
s = mv_avg(...)in a subsequenteval- this required less change, actually, as that's how we handle theavgsurrogate substitution today.
There was a problem hiding this comment.
Use a LocalRelation instead of Row which returns the necessary field - in fact, we should retrofit Row on top of LocalRelation sooner rather than later (and look into removing the row operator as well).
This avoids special casing too much.
...lugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Max.java
Outdated
Show resolved
Hide resolved
| // TODO: also allow aggregates once aggs on constants are supported. | ||
| // C.f. https://github.com/elastic/elasticsearch/issues/100634 |
There was a problem hiding this comment.
Sure - it would be good to see whether it should be ran more than once since Propagate runs across the whole plan.
|
|
||
| var project = as(plan, Project.class); | ||
| var limit = as(project.child(), Limit.class); | ||
| var row = as(limit.child(), Row.class); |
There was a problem hiding this comment.
Use a LocalRelation instead of Row which returns the necessary field - in fact, we should retrofit Row on top of LocalRelation sooner rather than later (and look into removing the row operator as well).
| } else { | ||
| // All aggs actually have been surrogates for evals - empty aggregation, but there still needs to be 1 row. | ||
| // Put a bogus literal into the row since empty rows are not allowed. | ||
| plan = new Row(source, List.of(new Alias(Source.EMPTY, "$$placeholder", Literal.NULL))); |
There was a problem hiding this comment.
No need for this hack - create a LocalRelation with an output schema and no values:
plan = new LocalRelation(source, new Alias...,, LocalSupplier.EMPTY)
Btw, what is the point of the alias - who's the consumer? Why not use the initial plan output?
There was a problem hiding this comment.
Attempted using LocalRelation this way already - the problem is that new LocalRelation(source, new Alias...,, LocalSupplier.EMPTY) creates 0 rows, but I need exactly 1 row instead.
What I can do is doing a similar hack, using a placeholder attribute (or the original output schema), with a LocalRelation with just 1 row; this will avoid further relying on RowExec.
The point of the $$placeholder attribute is that currently, plans with empty rows do not seem to be allowed, resp. not handled correctly; it'd be cleaner if that worked, though, but making that work is beyond the scope of this PR IMHO.
| } else { | ||
| // All aggs actually have been surrogates for (foldable) expressions, e.g. | ||
| // | ||
| // ... | ||
| // | STATS a = AVG([1,2]), m = min([1,2,3]) | ||
| // | ||
| // The output needs to be 1 row containing the results of the folding. | ||
| // To achieve this, we replace the aggregation with a single row and let the subsequent Eval produce the values, so | ||
| // the plan becomes essentially | ||
| // | ||
| // ROW placeholder = null | ||
| // | EVAL a = MV_AVG([1,2]), m = MV_MIN([1,2,3]) | ||
| plan = emptyRow(source); | ||
| } |
There was a problem hiding this comment.
@costin and I discussed another, similar approach using LocalRelation, namely replacing
... | STATS a = AVG([1,2]), m = min([1,2,3])
by
LocalRelation(source, attributes, singleRowLocalSupplier)`
where attributes consists of a and m, and singleRowLocalSupplier has a single row containing the values MV_AVG([1,2]) and MV_MIN([1,2,3]) (folded).
I don't think this is the best approach for the following reasons:
- Creating the
singleRowLocalSupplierrequires manually creating blocks with the folded values - this means that in addition to ourConstantFoldingoptimizer rule, we have another place that does constant folding; and we need to manually create proper blocks this time rather than literals, meaning that more low-level block handling leaks into the planner code. - This does not benefit from optimization rules that prune unused
Evalexpressions, while the present approach does. - This opens up 1 special edge case in which we do not use an
Evalto compute the values of the surrogate expressions but need to do this manually; this may require additional edge case handling later on.
astefan
left a comment
There was a problem hiding this comment.
LGTM
Left one minor comment
| // All aggs actually have been surrogates for (foldable) expressions, e.g. | ||
| // | ||
| // ... | ||
| // | STATS a = AVG([1,2]), m = min([1,2,3]) | ||
| // | ||
| // The output needs to be 1 row containing the results of the folding. | ||
| // To achieve this, we replace the aggregation with a single row and let the subsequent Eval produce the values, so | ||
| // the plan becomes essentially | ||
| // | ||
| // ROW placeholder = null | ||
| // | EVAL a = MV_AVG([1,2]), m = MV_MIN([1,2,3]) |
There was a problem hiding this comment.
Imo, this a too long explanation which is also a bit misleading. Could be shorter, but it's ok.
ROW placeholder = null is the misleading part. There is essentially no row (row in itself a command and can be confusing) node created. And also there is no placeholder element. If you really want an esql command in there as an explanation: from _nothing_ | eval .....
costin
left a comment
There was a problem hiding this comment.
Left some minor comments however LGTM.
| /** | ||
| * Create an (essentially) empty row; the row still has 1 placeholder attribute since plans with empty schemas are pruned. | ||
| */ | ||
| private static LocalRelation emptyRow(Source source) { |
There was a problem hiding this comment.
This isn't necessary as the block is always the same; I'm pretty sure we have something similar declared somewhere else inside the code.
Just create the relationship in place no need for an emptyRow method.
Speaking of, propagate the source instead of passing in Source.EMPTY - that's never a good solution inside the planner.
There was a problem hiding this comment.
Inlined and shortened!
| // All aggs actually have been surrogates for (foldable) expressions, e.g. | ||
| // | ||
| // ... | ||
| // | STATS a = AVG([1,2]), m = min([1,2,3]) | ||
| // | ||
| // The output needs to be 1 row containing the results of the folding. | ||
| // To achieve this, we replace the aggregation with a single row and let the subsequent Eval produce the values, so | ||
| // the plan becomes essentially | ||
| // | ||
| // ROW placeholder = null | ||
| // | EVAL a = MV_AVG([1,2]), m = MV_MIN([1,2,3]) |
There was a problem hiding this comment.
What Andrei said. Better to have this comment inside SurrogateExpression for future guidance - here it should be just one or two lines, explaining what's going on.
Addresses part of #100634.
Allow expressions like
... | STATS sum([1, -9]), sum(null), min(21.0*3), avg([1,2,3]).This is achieved by substituting
sum(const)bymv_sum(const)*count(*)andmin(const)bymv_min(const)(and similarly formaxandavg).