Skip to content

ESQL: Sum, Min, Max and Avg of constants#105454

Merged
alex-spies merged 27 commits intoelastic:mainfrom
alex-spies:sum-of-const-in-esql
Mar 26, 2024
Merged

ESQL: Sum, Min, Max and Avg of constants#105454
alex-spies merged 27 commits intoelastic:mainfrom
alex-spies:sum-of-const-in-esql

Conversation

@alex-spies
Copy link
Copy Markdown
Contributor

@alex-spies alex-spies commented Feb 13, 2024

Addresses part of #100634.

Allow expressions like ... | STATS sum([1, -9]), sum(null), min(21.0*3), avg([1,2,3]).

This is achieved by substituting sum(const) by mv_sum(const)*count(*) and min(const) by mv_min(const) (and similarly for max and avg).

Comment on lines +504 to +505
// TODO: also allow aggregates once aggs on constants are supported.
// C.f. https://github.com/elastic/elasticsearch/issues/100634
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be useful with aggs, we'd need to move PropagateEvalFoldables into the Substitutions batch and run this more than once - but we can worry about that later.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - it would be good to see whether it should be ran more than once since Propagate runs across the whole plan.

@alex-spies alex-spies marked this pull request as ready for review February 14, 2024 16:53
@alex-spies alex-spies marked this pull request as draft February 14, 2024 16:54
@alex-spies alex-spies marked this pull request as ready for review February 14, 2024 16:54
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @alex-spies, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 14, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Copy Markdown
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far but before commit to this approach, add other common aggs as well such as Min/Max and a non regular one such as percentile so we use the same approach across all of them.

@alex-spies
Copy link
Copy Markdown
Contributor Author

Implemented avg(const) and plan to add min/max as well.

I'd like to do percentile in a separate PR, though, as this PR's approach would require adding a mv_percentile function first: percentile([1,2,3], 75) would be a surrogate for the corresponding mv_-function then. I can implement median, though.

/**
* Expects
* Project[[s{r}#3, s_expr{r}#5, s_null{r}#7, w{r}#10]]
* \_Eval[[$$COUNT$s$0{r}#25 * 3[INTEGER] AS s, $$COUNT$s$0{r}#25 * 3.14[DOUBLE] AS s_expr, null[LONG] AS s_null]]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we need to see in a unit test that sum([1,2]) is in fact mv_sum * count.
This test here is ok, but it doesn't really check the transformation we are doing for sum([1,2]).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated - now the tests only check the result of the substitutions, and in particular we check whether we replace avg by mv_avg etc.


var project = as(plan, Project.class);
var limit = as(project.child(), Limit.class);
var row = as(limit.child(), Row.class);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably an error here with row.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with Andrei:

  • We both thought that it would be nicer to use a LocalRelation with 0 attributes and a subsequent Eval on top
    that contains s = mv_avg(...), ... - esp. given that Row is mostly used in tests and not production.
  • It turns out that LocalRelation with 0 attributes is not correct, though, as this would automatically have 0 rows as well; it seems that Row is the most fitting concept, here, after all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To stick closer to the original code in SubstituteSurrogates, I handle this case now by

  • creating a placeholder Row when we end up not having any actual aggs, and
  • handling the evaluation of s = mv_avg(...) in a subsequent eval - this required less change, actually, as that's how we handle the avg surrogate substitution today.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a LocalRelation instead of Row which returns the necessary field - in fact, we should retrofit Row on top of LocalRelation sooner rather than later (and look into removing the row operator as well).

Copy link
Copy Markdown
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a round of comments

Comment on lines +504 to +505
// TODO: also allow aggregates once aggs on constants are supported.
// C.f. https://github.com/elastic/elasticsearch/issues/100634
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - it would be good to see whether it should be ran more than once since Propagate runs across the whole plan.


var project = as(plan, Project.class);
var limit = as(project.child(), Limit.class);
var row = as(limit.child(), Row.class);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a LocalRelation instead of Row which returns the necessary field - in fact, we should retrofit Row on top of LocalRelation sooner rather than later (and look into removing the row operator as well).

} else {
// All aggs actually have been surrogates for evals - empty aggregation, but there still needs to be 1 row.
// Put a bogus literal into the row since empty rows are not allowed.
plan = new Row(source, List.of(new Alias(Source.EMPTY, "$$placeholder", Literal.NULL)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this hack - create a LocalRelation with an output schema and no values:
plan = new LocalRelation(source, new Alias...,, LocalSupplier.EMPTY)

Btw, what is the point of the alias - who's the consumer? Why not use the initial plan output?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempted using LocalRelation this way already - the problem is that new LocalRelation(source, new Alias...,, LocalSupplier.EMPTY) creates 0 rows, but I need exactly 1 row instead.

What I can do is doing a similar hack, using a placeholder attribute (or the original output schema), with a LocalRelation with just 1 row; this will avoid further relying on RowExec.

The point of the $$placeholder attribute is that currently, plans with empty rows do not seem to be allowed, resp. not handled correctly; it'd be cleaner if that worked, though, but making that work is beyond the scope of this PR IMHO.

@alex-spies alex-spies changed the title ESQL: Sum of constants ESQL: Sum, Min, Max and Avg of constants Mar 21, 2024
Comment on lines +253 to +266
} else {
// All aggs actually have been surrogates for (foldable) expressions, e.g.
//
// ...
// | STATS a = AVG([1,2]), m = min([1,2,3])
//
// The output needs to be 1 row containing the results of the folding.
// To achieve this, we replace the aggregation with a single row and let the subsequent Eval produce the values, so
// the plan becomes essentially
//
// ROW placeholder = null
// | EVAL a = MV_AVG([1,2]), m = MV_MIN([1,2,3])
plan = emptyRow(source);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@costin and I discussed another, similar approach using LocalRelation, namely replacing

... | STATS a = AVG([1,2]), m = min([1,2,3])

by

LocalRelation(source, attributes, singleRowLocalSupplier)`

where attributes consists of a and m, and singleRowLocalSupplier has a single row containing the values MV_AVG([1,2]) and MV_MIN([1,2,3]) (folded).

I don't think this is the best approach for the following reasons:

  • Creating the singleRowLocalSupplier requires manually creating blocks with the folded values - this means that in addition to our ConstantFolding optimizer rule, we have another place that does constant folding; and we need to manually create proper blocks this time rather than literals, meaning that more low-level block handling leaks into the planner code.
  • This does not benefit from optimization rules that prune unused Eval expressions, while the present approach does.
  • This opens up 1 special edge case in which we do not use an Eval to compute the values of the surrogate expressions but need to do this manually; this may require additional edge case handling later on.

Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Left one minor comment

Comment on lines +254 to +264
// All aggs actually have been surrogates for (foldable) expressions, e.g.
//
// ...
// | STATS a = AVG([1,2]), m = min([1,2,3])
//
// The output needs to be 1 row containing the results of the folding.
// To achieve this, we replace the aggregation with a single row and let the subsequent Eval produce the values, so
// the plan becomes essentially
//
// ROW placeholder = null
// | EVAL a = MV_AVG([1,2]), m = MV_MIN([1,2,3])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo, this a too long explanation which is also a bit misleading. Could be shorter, but it's ok.
ROW placeholder = null is the misleading part. There is essentially no row (row in itself a command and can be confusing) node created. And also there is no placeholder element. If you really want an esql command in there as an explanation: from _nothing_ | eval .....

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified!

Copy link
Copy Markdown
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments however LGTM.

/**
* Create an (essentially) empty row; the row still has 1 placeholder attribute since plans with empty schemas are pruned.
*/
private static LocalRelation emptyRow(Source source) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary as the block is always the same; I'm pretty sure we have something similar declared somewhere else inside the code.
Just create the relationship in place no need for an emptyRow method.
Speaking of, propagate the source instead of passing in Source.EMPTY - that's never a good solution inside the planner.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined and shortened!

Comment on lines +254 to +264
// All aggs actually have been surrogates for (foldable) expressions, e.g.
//
// ...
// | STATS a = AVG([1,2]), m = min([1,2,3])
//
// The output needs to be 1 row containing the results of the folding.
// To achieve this, we replace the aggregation with a single row and let the subsequent Eval produce the values, so
// the plan becomes essentially
//
// ROW placeholder = null
// | EVAL a = MV_AVG([1,2]), m = MV_MIN([1,2,3])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Andrei said. Better to have this comment inside SurrogateExpression for future guidance - here it should be just one or two lines, explaining what's going on.

@alex-spies alex-spies merged commit 829ea4d into elastic:main Mar 26, 2024
@alex-spies alex-spies deleted the sum-of-const-in-esql branch March 26, 2024 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.14.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants