Skip to content

ESQL: median, count and count_distinct over constants#107414

Merged
alex-spies merged 24 commits intoelastic:mainfrom
alex-spies:esql-more-aggs-on-consts
Apr 23, 2024
Merged

ESQL: median, count and count_distinct over constants#107414
alex-spies merged 24 commits intoelastic:mainfrom
alex-spies:esql-more-aggs-on-consts

Conversation

@alex-spies
Copy link
Copy Markdown
Contributor

@alex-spies alex-spies commented Apr 12, 2024

Second batch of aggregations over constants for #100634 . Makes COUNT(constant) consistent and adds MEDIAN(const) and COUNT_DISTINCT(const).

Needed to make this work, needs to be removed via rebasing.
This needs to be removed by rebasing before merging.
Failed because then duplicates of count(*) occur which need to be
pruned.
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @alex-spies, I've created a changelog YAML for you.

@alex-spies alex-spies changed the title ESQL: median, count and count_distinct over constans ESQL: median, count and count_distinct over constants Apr 12, 2024
This pushed stats to source iff there is only one stat name among all
the stats. This can be wrong if multiple stats share the same name, e.g.
when encountering two copies of COUNT(*).
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @alex-spies, I've updated the changelog YAML for you.

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @alex-spies, I've updated the changelog YAML for you.

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @alex-spies, I've updated the changelog YAML for you.

Comment on lines -379 to -381
* \_EsStatsQueryExec[test], stats[Stat[name=*, type=COUNT, query=null], Stat[name=*, type=COUNT, query=null]]],
* query[{"esql_single_value":{"field":"emp_no","next":{"range":{"emp_no":{"gt":10010,"boost":1.0}}},
* "source":"emp_no > 10010@2:9"}}][count{r}#23, seen{r}#24, count{r}#25, seen{r}#26], limit[],
Copy link
Copy Markdown
Contributor Author

@alex-spies alex-spies Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| stats c = count(), call = count(*), c_literal = count(1) produced an EsStatsQueryExec that wasn't plannable by the LocalExecutionPlanner (EsStatsQuery currently supports only one field statistic).

I think a way to properly fix this is:

  • normalize count(*), count(1), count(), count("foobar") all to count(*) as part of surrogate substitution
  • deduplicate the counts - this requires refactoring SubstituteExpressions as its deduplication doesn't currently work well enough for this approach, and there's overlap with the deduplication happening in ReplaceStatsAggExpressionWithEval.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the count normalization happens, the deduplication should kick in (since all counts will be the same) and this type of query should work.
The critical bit here is normalization so count(), count(1), count(), count() and count(1+1+2) all fold back to the same expression of count(1).

@alex-spies alex-spies marked this pull request as ready for review April 17, 2024 16:34
@alex-spies alex-spies requested review from astefan and costin April 17, 2024 16:35
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 17, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @alex-spies, I've updated the changelog YAML for you.

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @alex-spies, I've updated the changelog YAML for you.

;

s1:l | s_mv:l | s_null:l | s_param:l
1 | 4 | 0 | 4
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s_mv = s_param looks weird.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s_mv = s_param looks weird.

The second/optional argument(precision), tricked me, before knowing the count_distinct is approximate count distinct. It is irrelevant to the issue this PR addresses though.

Some databases offers two ways to count distinct, perhaps we can consider later:
count(distinct ): this is an exact count, which may perform slow on large dataset
approximate_count_distinct function: this is an estimation.

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @alex-spies, I've updated the changelog YAML for you.

@alex-spies alex-spies added auto-backport Automatically create backport pull requests when merged and removed auto-backport-and-merge labels Apr 18, 2024
Copy link
Copy Markdown
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good over all. Since count(null) is problematic it makes sense to extract that piece of code and backport it to 8.14 as a bug fix.

var field = field();

if (field.foldable()) {
if (field instanceof Literal l) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use field.fold() since the field can be an expression such as 1 + 1 and you want to return the surrogate before the rest of the optimizer rules kick in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to just fold, but folding here is not safe unless we check if this will fold to null, first. That would require duplicating logic from FoldNull, though, which I'd prefer to avoid.

(Our evaluators cannot deal with null literals in general, and folding here results in the following exception when running e.g. count(to_double(null)):

illegal data type [null]
org.elasticsearch.xpack.esql.EsqlIllegalArgumentException: illegal data type [null]
	at __randomizedtesting.SeedInfo.seed([DBCF40F13A40EC92:539B7F2B94BC816A]:0)
	at org.elasticsearch.xpack.esql.CsvTests.stats.Asdf(stats.csv-spec:1708)
	at org.elasticsearch.xpack.esql.EsqlIllegalArgumentException.illegalDataType(EsqlIllegalArgumentException.java:43)
	at org.elasticsearch.xpack.esql.EsqlIllegalArgumentException.illegalDataType(EsqlIllegalArgumentException.java:39)
	at org.elasticsearch.xpack.esql.expression.function.scalar.convert.AbstractConvertFunction.evaluator(AbstractConvertFunction.java:65)
	at org.elasticsearch.xpack.esql.expression.function.scalar.convert.AbstractConvertFunction.toEvaluator(AbstractConvertFunction.java:105)
	at org.elasticsearch.xpack.esql.evaluator.mapper.EvaluatorMapper.fold(EvaluatorMapper.java:48)
	at org.elasticsearch.xpack.esql.expression.function.scalar.EsqlScalarFunction.fold(EsqlScalarFunction.java:30)

)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the rare case of count(1+1) - there might be count(1+null). This can be moved to a separate issue though to not block this one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to add more tests to account for these cases where an expression is evaluated, esp. to null. I'll do that in another PR, as this one should be backported to 8.14 as soon as possible.

These cases are covered in this PR, it's just that they result in an extra eval+projection.

| STATS count(1+null)

becomes something equivalent to

| STATS x = count(*)
| EVAL `count(1+null)` = COALESCE(MV_COUNT(1+null), 0) * x
| KEEP `count(1+null)`

which after folding becomes a multiplication of count(*) with 0. If we wanted to, we could further optimize this case to avoid the count(*) as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a PR with additional tests: #107888

return new Mul(
s,
new Coalesce(s, new MvCount(s, field), List.of(new Literal(s, 0, DataTypes.INTEGER))),
new Count(s, new Literal(s, StringUtils.WILDCARD, DataTypes.KEYWORD))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere in the code we use the convention of Count(1) - let's apply that here as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I tried to look for other places where Count(1) is used, but couldn't verify this (for ESQL at least).

Parsing count() results in new Count(source, "*", KEYWORD), we also use the same in Sum.surrogate() and the (currently disabled) NormalizeAggregate used to replace all count(literal) by count(*) as well.

For consistency, I'd like to stick to count(*) here.

Comment on lines -379 to -381
* \_EsStatsQueryExec[test], stats[Stat[name=*, type=COUNT, query=null], Stat[name=*, type=COUNT, query=null]]],
* query[{"esql_single_value":{"field":"emp_no","next":{"range":{"emp_no":{"gt":10010,"boost":1.0}}},
* "source":"emp_no > 10010@2:9"}}][count{r}#23, seen{r}#24, count{r}#25, seen{r}#26], limit[],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the count normalization happens, the deduplication should kick in (since all counts will be the same) and this type of query should work.
The critical bit here is normalization so count(), count(1), count(), count() and count(1+1+2) all fold back to the same expression of count(1).

@alex-spies alex-spies merged commit d966147 into elastic:main Apr 23, 2024
@alex-spies alex-spies deleted the esql-more-aggs-on-consts branch April 23, 2024 08:07
alex-spies added a commit to alex-spies/elasticsearch that referenced this pull request Apr 23, 2024
* Make COUNT(constant) consistent.
* Add MEDIAN(const) and COUNT_DISTINCT(const).
* Fix wrong stats pushdown when multiple COUNT aggs are in the same STATS
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
8.14

alex-spies added a commit that referenced this pull request Apr 23, 2024
#107749)

* Make COUNT(constant) consistent.
* Add MEDIAN(const) and COUNT_DISTINCT(const).
* Fix wrong stats pushdown when multiple COUNT aggs are in the same STATS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.14.0 v8.15.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ESQL: COUNT inconsistent for multi-values ESQL: Inconsistent/buggy behavior for stats with null expression

4 participants