ESQL: median, count and count_distinct over constants#107414
ESQL: median, count and count_distinct over constants#107414alex-spies merged 24 commits intoelastic:mainfrom
Conversation
Needed to make this work, needs to be removed via rebasing.
This needs to be removed by rebasing before merging.
Failed because then duplicates of count(*) occur which need to be pruned.
|
Hi @alex-spies, I've created a changelog YAML for you. |
This pushed stats to source iff there is only one stat name among all the stats. This can be wrong if multiple stats share the same name, e.g. when encountering two copies of COUNT(*).
|
Hi @alex-spies, I've updated the changelog YAML for you. |
|
Hi @alex-spies, I've updated the changelog YAML for you. |
|
Hi @alex-spies, I've updated the changelog YAML for you. |
| * \_EsStatsQueryExec[test], stats[Stat[name=*, type=COUNT, query=null], Stat[name=*, type=COUNT, query=null]]], | ||
| * query[{"esql_single_value":{"field":"emp_no","next":{"range":{"emp_no":{"gt":10010,"boost":1.0}}}, | ||
| * "source":"emp_no > 10010@2:9"}}][count{r}#23, seen{r}#24, count{r}#25, seen{r}#26], limit[], |
There was a problem hiding this comment.
| stats c = count(), call = count(*), c_literal = count(1) produced an EsStatsQueryExec that wasn't plannable by the LocalExecutionPlanner (EsStatsQuery currently supports only one field statistic).
I think a way to properly fix this is:
- normalize
count(*), count(1), count(), count("foobar")all tocount(*)as part of surrogate substitution - deduplicate the counts - this requires refactoring
SubstituteExpressionsas its deduplication doesn't currently work well enough for this approach, and there's overlap with the deduplication happening inReplaceStatsAggExpressionWithEval.
There was a problem hiding this comment.
Once the count normalization happens, the deduplication should kick in (since all counts will be the same) and this type of query should work.
The critical bit here is normalization so count(), count(1), count(), count() and count(1+1+2) all fold back to the same expression of count(1).
|
Hi @alex-spies, I've updated the changelog YAML for you. |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hi @alex-spies, I've updated the changelog YAML for you. |
| ; | ||
|
|
||
| s1:l | s_mv:l | s_null:l | s_param:l | ||
| 1 | 4 | 0 | 4 |
There was a problem hiding this comment.
s_mv = s_param looks weird.
There was a problem hiding this comment.
s_mv = s_param looks weird.
The second/optional argument(precision), tricked me, before knowing the count_distinct is approximate count distinct. It is irrelevant to the issue this PR addresses though.
Some databases offers two ways to count distinct, perhaps we can consider later:
count(distinct ): this is an exact count, which may perform slow on large dataset
approximate_count_distinct function: this is an estimation.
|
Hi @alex-spies, I've updated the changelog YAML for you. |
costin
left a comment
There was a problem hiding this comment.
Looks good over all. Since count(null) is problematic it makes sense to extract that piece of code and backport it to 8.14 as a bug fix.
| var field = field(); | ||
|
|
||
| if (field.foldable()) { | ||
| if (field instanceof Literal l) { |
There was a problem hiding this comment.
use field.fold() since the field can be an expression such as 1 + 1 and you want to return the surrogate before the rest of the optimizer rules kick in.
There was a problem hiding this comment.
Would love to just fold, but folding here is not safe unless we check if this will fold to null, first. That would require duplicating logic from FoldNull, though, which I'd prefer to avoid.
(Our evaluators cannot deal with null literals in general, and folding here results in the following exception when running e.g. count(to_double(null)):
illegal data type [null]
org.elasticsearch.xpack.esql.EsqlIllegalArgumentException: illegal data type [null]
at __randomizedtesting.SeedInfo.seed([DBCF40F13A40EC92:539B7F2B94BC816A]:0)
at org.elasticsearch.xpack.esql.CsvTests.stats.Asdf(stats.csv-spec:1708)
at org.elasticsearch.xpack.esql.EsqlIllegalArgumentException.illegalDataType(EsqlIllegalArgumentException.java:43)
at org.elasticsearch.xpack.esql.EsqlIllegalArgumentException.illegalDataType(EsqlIllegalArgumentException.java:39)
at org.elasticsearch.xpack.esql.expression.function.scalar.convert.AbstractConvertFunction.evaluator(AbstractConvertFunction.java:65)
at org.elasticsearch.xpack.esql.expression.function.scalar.convert.AbstractConvertFunction.toEvaluator(AbstractConvertFunction.java:105)
at org.elasticsearch.xpack.esql.evaluator.mapper.EvaluatorMapper.fold(EvaluatorMapper.java:48)
at org.elasticsearch.xpack.esql.expression.function.scalar.EsqlScalarFunction.fold(EsqlScalarFunction.java:30)
)
There was a problem hiding this comment.
Consider the rare case of count(1+1) - there might be count(1+null). This can be moved to a separate issue though to not block this one.
There was a problem hiding this comment.
It makes sense to add more tests to account for these cases where an expression is evaluated, esp. to null. I'll do that in another PR, as this one should be backported to 8.14 as soon as possible.
These cases are covered in this PR, it's just that they result in an extra eval+projection.
| STATS count(1+null)
becomes something equivalent to
| STATS x = count(*)
| EVAL `count(1+null)` = COALESCE(MV_COUNT(1+null), 0) * x
| KEEP `count(1+null)`
which after folding becomes a multiplication of count(*) with 0. If we wanted to, we could further optimize this case to avoid the count(*) as well.
There was a problem hiding this comment.
Opened a PR with additional tests: #107888
| return new Mul( | ||
| s, | ||
| new Coalesce(s, new MvCount(s, field), List.of(new Literal(s, 0, DataTypes.INTEGER))), | ||
| new Count(s, new Literal(s, StringUtils.WILDCARD, DataTypes.KEYWORD)) |
There was a problem hiding this comment.
Elsewhere in the code we use the convention of Count(1) - let's apply that here as well.
There was a problem hiding this comment.
Hm, I tried to look for other places where Count(1) is used, but couldn't verify this (for ESQL at least).
Parsing count() results in new Count(source, "*", KEYWORD), we also use the same in Sum.surrogate() and the (currently disabled) NormalizeAggregate used to replace all count(literal) by count(*) as well.
For consistency, I'd like to stick to count(*) here.
| * \_EsStatsQueryExec[test], stats[Stat[name=*, type=COUNT, query=null], Stat[name=*, type=COUNT, query=null]]], | ||
| * query[{"esql_single_value":{"field":"emp_no","next":{"range":{"emp_no":{"gt":10010,"boost":1.0}}}, | ||
| * "source":"emp_no > 10010@2:9"}}][count{r}#23, seen{r}#24, count{r}#25, seen{r}#26], limit[], |
There was a problem hiding this comment.
Once the count normalization happens, the deduplication should kick in (since all counts will be the same) and this type of query should work.
The critical bit here is normalization so count(), count(1), count(), count() and count(1+1+2) all fold back to the same expression of count(1).
* Make COUNT(constant) consistent. * Add MEDIAN(const) and COUNT_DISTINCT(const). * Fix wrong stats pushdown when multiple COUNT aggs are in the same STATS
💚 Backport successful
|
#107749) * Make COUNT(constant) consistent. * Add MEDIAN(const) and COUNT_DISTINCT(const). * Fix wrong stats pushdown when multiple COUNT aggs are in the same STATS
Second batch of aggregations over constants for #100634 . Makes
COUNT(constant)consistent and addsMEDIAN(const)andCOUNT_DISTINCT(const).COUNTaggs are in the sameSTATS, e.g. forFROM testidx | STATS s1 = count(1), rows = count(*)