ESQL: median, count and count_distinct over constants by alex-spies · Pull Request #107414 · elastic/elasticsearch

alex-spies · 2024-04-12T12:32:45Z

Second batch of aggregations over constants for #100634 . Makes COUNT(constant) consistent and adds MEDIAN(const) and COUNT_DISTINCT(const).

Fix ESQL: COUNT inconsistent for multi-values #105248 .
Fix ESQL: Inconsistent/buggy behavior for stats with null expression #104900 .
Fix wrong stats pushdown when multiple COUNT aggs are in the same STATS, e.g. for FROM testidx | STATS s1 = count(1), rows = count(*)

Needed to make this work, needs to be removed via rebasing.

This needs to be removed by rebasing before merging.

Failed because then duplicates of count(*) occur which need to be pruned.

elasticsearchmachine · 2024-04-12T12:33:08Z

Hi @alex-spies, I've created a changelog YAML for you.

This pushed stats to source iff there is only one stat name among all the stats. This can be wrong if multiple stats share the same name, e.g. when encountering two copies of COUNT(*).

elasticsearchmachine · 2024-04-17T09:13:35Z

Hi @alex-spies, I've updated the changelog YAML for you.

elasticsearchmachine · 2024-04-17T09:23:04Z

Hi @alex-spies, I've updated the changelog YAML for you.

elasticsearchmachine · 2024-04-17T09:40:44Z

Hi @alex-spies, I've updated the changelog YAML for you.

alex-spies · 2024-04-17T10:39:47Z

...ql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizerTests.java

-     *       \_EsStatsQueryExec[test], stats[Stat[name=*, type=COUNT, query=null], Stat[name=*, type=COUNT, query=null]]],
-     *         query[{"esql_single_value":{"field":"emp_no","next":{"range":{"emp_no":{"gt":10010,"boost":1.0}}},
-     *         "source":"emp_no > 10010@2:9"}}][count{r}#23, seen{r}#24, count{r}#25, seen{r}#26], limit[],


| stats c = count(), call = count(*), c_literal = count(1) produced an EsStatsQueryExec that wasn't plannable by the LocalExecutionPlanner (EsStatsQuery currently supports only one field statistic).

I think a way to properly fix this is:

normalize count(*), count(1), count(), count("foobar") all to count(*) as part of surrogate substitution

deduplicate the counts - this requires refactoring SubstituteExpressions as its deduplication doesn't currently work well enough for this approach, and there's overlap with the deduplication happening in ReplaceStatsAggExpressionWithEval.

Once the count normalization happens, the deduplication should kick in (since all counts will be the same) and this type of query should work.
The critical bit here is normalization so count(), count(1), count(), count() and count(1+1+2) all fold back to the same expression of count(1).

…onsts

elasticsearchmachine · 2024-04-17T16:35:16Z

Hi @alex-spies, I've updated the changelog YAML for you.

elasticsearchmachine · 2024-04-17T16:35:16Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2024-04-17T17:19:42Z

Hi @alex-spies, I've updated the changelog YAML for you.

fang-xing-esql · 2024-04-17T21:48:19Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats.csv-spec

+;
+
+s1:l | s_mv:l | s_null:l | s_param:l
+1    | 4      | 0        | 4


s_mv = s_param looks weird.

s_mv = s_param looks weird.

The second/optional argument(precision), tricked me, before knowing the count_distinct is approximate count distinct. It is irrelevant to the issue this PR addresses though.

Some databases offers two ways to count distinct, perhaps we can consider later:
count(distinct ): this is an exact count, which may perform slow on large dataset
approximate_count_distinct function: this is an estimation.

elasticsearchmachine · 2024-04-18T07:40:41Z

Hi @alex-spies, I've updated the changelog YAML for you.

…onsts

costin

Looks good over all. Since count(null) is problematic it makes sense to extract that piece of code and backport it to 8.14 as a bug fix.

costin · 2024-04-22T07:12:26Z

...gin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Count.java

+        var field = field();
+
+        if (field.foldable()) {
+            if (field instanceof Literal l) {


use field.fold() since the field can be an expression such as 1 + 1 and you want to return the surrogate before the rest of the optimizer rules kick in.

Would love to just fold, but folding here is not safe unless we check if this will fold to null, first. That would require duplicating logic from FoldNull, though, which I'd prefer to avoid.

(Our evaluators cannot deal with null literals in general, and folding here results in the following exception when running e.g. count(to_double(null)):

illegal data type [null] org.elasticsearch.xpack.esql.EsqlIllegalArgumentException: illegal data type [null] at __randomizedtesting.SeedInfo.seed([DBCF40F13A40EC92:539B7F2B94BC816A]:0) at org.elasticsearch.xpack.esql.CsvTests.stats.Asdf(stats.csv-spec:1708) at org.elasticsearch.xpack.esql.EsqlIllegalArgumentException.illegalDataType(EsqlIllegalArgumentException.java:43) at org.elasticsearch.xpack.esql.EsqlIllegalArgumentException.illegalDataType(EsqlIllegalArgumentException.java:39) at org.elasticsearch.xpack.esql.expression.function.scalar.convert.AbstractConvertFunction.evaluator(AbstractConvertFunction.java:65) at org.elasticsearch.xpack.esql.expression.function.scalar.convert.AbstractConvertFunction.toEvaluator(AbstractConvertFunction.java:105) at org.elasticsearch.xpack.esql.evaluator.mapper.EvaluatorMapper.fold(EvaluatorMapper.java:48) at org.elasticsearch.xpack.esql.expression.function.scalar.EsqlScalarFunction.fold(EsqlScalarFunction.java:30)

)

Consider the rare case of count(1+1) - there might be count(1+null). This can be moved to a separate issue though to not block this one.

It makes sense to add more tests to account for these cases where an expression is evaluated, esp. to null. I'll do that in another PR, as this one should be backported to 8.14 as soon as possible.

These cases are covered in this PR, it's just that they result in an extra eval+projection.

| STATS count(1+null)

becomes something equivalent to

| STATS x = count(*) | EVAL `count(1+null)` = COALESCE(MV_COUNT(1+null), 0) * x | KEEP `count(1+null)`

which after folding becomes a multiplication of count(*) with 0. If we wanted to, we could further optimize this case to avoid the count(*) as well.

Opened a PR with additional tests: #107888

costin · 2024-04-22T07:13:53Z

...gin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Count.java

+            return new Mul(
+                s,
+                new Coalesce(s, new MvCount(s, field), List.of(new Literal(s, 0, DataTypes.INTEGER))),
+                new Count(s, new Literal(s, StringUtils.WILDCARD, DataTypes.KEYWORD))


Elsewhere in the code we use the convention of Count(1) - let's apply that here as well.

Hm, I tried to look for other places where Count(1) is used, but couldn't verify this (for ESQL at least).

Parsing count() results in new Count(source, "*", KEYWORD), we also use the same in Sum.surrogate() and the (currently disabled) NormalizeAggregate used to replace all count(literal) by count(*) as well.

For consistency, I'd like to stick to count(*) here.

costin · 2024-04-22T07:17:32Z

...ql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LocalPhysicalPlanOptimizerTests.java

-     *       \_EsStatsQueryExec[test], stats[Stat[name=*, type=COUNT, query=null], Stat[name=*, type=COUNT, query=null]]],
-     *         query[{"esql_single_value":{"field":"emp_no","next":{"range":{"emp_no":{"gt":10010,"boost":1.0}}},
-     *         "source":"emp_no > 10010@2:9"}}][count{r}#23, seen{r}#24, count{r}#25, seen{r}#26], limit[],


Once the count normalization happens, the deduplication should kick in (since all counts will be the same) and this type of query should work.
The critical bit here is normalization so count(), count(1), count(), count() and count(1+1+2) all fold back to the same expression of count(1).

…onsts

* Make COUNT(constant) consistent. * Add MEDIAN(const) and COUNT_DISTINCT(const). * Fix wrong stats pushdown when multiple COUNT aggs are in the same STATS

elasticsearchmachine · 2024-04-23T08:08:38Z

💚 Backport successful

Status	Branch	Result
✅	8.14

#107749) * Make COUNT(constant) consistent. * Add MEDIAN(const) and COUNT_DISTINCT(const). * Fix wrong stats pushdown when multiple COUNT aggs are in the same STATS

alex-spies added 6 commits April 12, 2024 14:27

Cherry-pick: Fix leftover references to pruned columns in aggs

b15c7c9

Needed to make this work, needs to be removed via rebasing.

Cherry-pick: Update LogicalPlanOptimizerTests

3732546

This needs to be removed by rebasing before merging.

median(const)

ba6f831

count(const)

1612edd

count_distinct(const)

0713bf8

Attempt: Normalize count(), count(*), count(1) etc.

c9cd16d

Failed because then duplicates of count(*) occur which need to be pruned.

alex-spies added >enhancement :Analytics/ES|QL AKA ESQL v8.14.0 labels Apr 12, 2024

Update docs/changelog/107414.yaml

a32e88e

alex-spies changed the title ~~ESQL: median, count and count_distinct over constans~~ ESQL: median, count and count_distinct over constants Apr 12, 2024

alex-spies mentioned this pull request Apr 12, 2024

Support aggregations across constants #100634

Open

4 tasks

Fix PushStatsToSource

491b5ba

This pushed stats to source iff there is only one stat name among all the stats. This can be wrong if multiple stats share the same name, e.g. when encountering two copies of COUNT(*).

alex-spies added the >bug label Apr 17, 2024

Update docs/changelog/107414.yaml

581ade2

Update docs/changelog/107414.yaml

c47bab1

alex-spies added 3 commits April 17, 2024 11:26

Make the changelog valid again

dfe53fa

Fix typo in changelog

cf1ef3c

Update docs/changelog/107414.yaml

ae879a4

alex-spies added 2 commits April 17, 2024 11:46

Fix changelog, again...

a77dc08

Update test

7706d5f

alex-spies commented Apr 17, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into esql-more-aggs-on-c…

6cea094

…onsts

alex-spies marked this pull request as ready for review April 17, 2024 16:34

alex-spies requested review from astefan and costin April 17, 2024 16:35

alex-spies requested a review from fang-xing-esql April 17, 2024 16:35

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 17, 2024

Update docs/changelog/107414.yaml

7ac993a

Fix changelog, yet again

626d150

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

Update docs/changelog/107414.yaml

340c10f

fang-xing-esql reviewed Apr 17, 2024

View reviewed changes

alex-spies added v8.14.0 auto-backport-and-merge and removed >enhancement labels Apr 18, 2024

alex-spies added 2 commits April 18, 2024 09:40

Update docs/changelog/107414.yaml

d7fff22

Merge remote-tracking branch 'upstream/main' into esql-more-aggs-on-c…

2dc3eea

…onsts

alex-spies added auto-backport Automatically create backport pull requests when merged and removed auto-backport-and-merge labels Apr 18, 2024

alex-spies added 2 commits April 18, 2024 09:50

Update csv test skips for 8.15

dd47b9c

One more skip update

496a244

costin approved these changes Apr 22, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into esql-more-aggs-on-c…

cf10cf9

…onsts

alex-spies merged commit d966147 into elastic:main Apr 23, 2024

alex-spies deleted the esql-more-aggs-on-consts branch April 23, 2024 08:07

alex-spies mentioned this pull request Apr 23, 2024

[8.14] ESQL: median, count and count_distinct over constants (#107414) #107749

Merged

Conversation

alex-spies commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 12, 2024

Uh oh!

elasticsearchmachine commented Apr 17, 2024

Uh oh!

elasticsearchmachine commented Apr 17, 2024

Uh oh!

elasticsearchmachine commented Apr 17, 2024

Uh oh!

alex-spies Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Apr 17, 2024

Uh oh!

elasticsearchmachine commented Apr 17, 2024

Uh oh!

elasticsearchmachine commented Apr 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Apr 18, 2024

Uh oh!

costin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Apr 23, 2024

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alex-spies commented Apr 12, 2024 •

edited

Loading

alex-spies Apr 17, 2024 •

edited

Loading