Skip to content

SQL: Implement scripting inside aggs#55241

Merged
matriv merged 13 commits intoelastic:masterfrom
matriv:impl-29980
Apr 17, 2020
Merged

SQL: Implement scripting inside aggs#55241
matriv merged 13 commits intoelastic:masterfrom
matriv:impl-29980

Conversation

@matriv
Copy link
Copy Markdown
Contributor

@matriv matriv commented Apr 15, 2020

Implement the use of scalar functions inside aggregate functions.
This allows for complex expressions inside aggregations, with or without
GROUBY as well as with or without a HAVING clause. e.g.:

SELECT MAX(CASE WHEN a IS NULL then -1 ELSE abs(a * 10) + 1 END) AS max, b
FROM test
GROUP BY b
HAVING MAX(CASE WHEN a IS NULL then -1 ELSE abs(a * 10) + 1 END) > 5

Scalar functions are still not allowed for KURTOSIS and SKEWNESS as
this is currently not implemented on the Elasticsearch side.

Fixes: #29980
Fixes: #36865
Fixes: #37271

@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-ql (:Query Languages/SQL)

Copy link
Copy Markdown
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment

Comment on lines 527 to 534
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this code becomes:

Object argument = asFieldOrLiteralOrScript(c.field());

           if (c.distinct()) {
                    return new CardinalityAgg(id, argument);
                } else {
                    return new FilterExistsAgg(id, argument);
                }

@matriv matriv force-pushed the impl-29980 branch 2 times, most recently from 1ae462d to 8bbfaa1 Compare April 15, 2020 19:36
Implement the use of scalar functions inside aggregate functions.
This allows for complex expressions inside aggregations, with or without
GROUBY as well as with or without a HAVING clause. e.g.:

```
SELECT MAX(CASE WHEN a IS NULL then -1 ELSE abs(a * 10) + 1 END) AS max, b
FROM test
GROUP BY b
HAVING MAX(CASE WHEN a IS NULL then -1 ELSE abs(a * 10) + 1 END) > 5
```

Scalar functions are still not allowed for `KURTOSIS` and `SKEWNESS` as
this is currently not implemented on the ElasticSearch side.

Fixes: elastic#29980
Fixes: elastic#36865
Fixes: elastic#37271
@matriv matriv requested review from astefan, bpintea and costin April 15, 2020 19:43
@matriv matriv marked this pull request as ready for review April 15, 2020 19:44
Copy link
Copy Markdown
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Left some comments regarding testing and potentially improvements in the agg hierarchy- if they are unclear, we can address that through a different PR.

DataTypes.BOOLEAN);
}

public static ScriptTemplate isNotNullCardinality(ScriptTemplate script) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems to be used in one place as such extracting it in a different class doesn't add much value since there's no reuse.
The name also indicates that - a method specific to cardinality/count.


// aggregates with scalars
aggregateFunctionsWithScalars
SELECT MAX(CASE WHEN (salary - 10) > 70000 THEN (salary + 12345) * 1.2 ELSE (salary - 12345) * 2.7 END) AS "max",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice



// aggregates with scalars
aggregateFunctionsWithScalars
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a couple of tests where the aggs with scalars are inside having?
Even better have scalars over aggs over scalars inside Having.

Also another that sorts by an aggregation with an internal scalar.

Copy link
Copy Markdown
Contributor Author

@matriv matriv Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a couple of tests where the aggs with scalars are inside having?
There is already:

https://github.com/elastic/elasticsearch/pull/55241/files/b8bc29f350f24b22bde8b9096a95d4273dd822fe#diff-002044c6b4e1de5a6960847971f85532R986

Even better have scalars over aggs over scalars inside Having.

I will add one with scalars also on top of the agg function.

Also another that sorts by an aggregation with an internal scalar.

I have already: https://github.com/elastic/elasticsearch/pull/55241/files/b8bc29f350f24b22bde8b9096a95d4273dd822fe#diff-002044c6b4e1de5a6960847971f85532R960

Copy link
Copy Markdown
Member

@costin costin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already:

I was thinking of a test with the agg declared just inside the order and not referencing an already declared aggregation.

static String field(AggregateFunction af) {
Expression arg = af.field();
if (arg.foldable()) {
return String.valueOf(((Literal) arg).value());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.valueOf(arg.fold())
There's rarely a reason to cast to a Literal

}

private static Object getFieldOrLiteralOrScript(AggregateFunction af) {
if (isFieldOrLiteral(af)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return isFieldOrLiteral(af) ? field(af) : ((ScalarFunction) af.field()).asScript()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic from topAggsFieldOrScript and this method is similar and should be reused - I like the instanceof check for Scalar instead of the direct cast; a bit verbose but safe.

return af.field().foldable() || af.field() instanceof FieldAttribute;
}

private static Object getFieldOrLiteralOrScript(AggregateFunction af) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get prefix is for getter - use something like toField.. or asField..

}

private static String topAggsField(AggregateFunction af, Expression e) {
private static Object topAggsFieldOrScript(AggregateFunction af, Expression e) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved under the TopHit agg class

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated into one method.

private final ScriptTemplate scriptTemplate;

Agg(String id, String fieldName) {
Agg(String id, Object fieldOrScript) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit more about this, a better solution might be creating a simple DTO specific for Agg that can hold either a String or a ScriptTemplate.
That one would handle the hashCode, equals, etc.
Further more maybe extend it to accept behavior by accepting a ValuesSourceAggregationBuilder and internally call either field or script based on its content.

Suggested change
Agg(String id, Object fieldOrScript) {
Agg(String id, AggTarget target) {
this.target = target;
}
...
class AggTarget {
private final String fieldName;
private final ScriptTemplate script;
...
ValuesSourceAggregationBuilder with(ValuesSourceAggregationBuilder aggBuilder) {
if (field != null) {
aggBuilder.field(field);
}
else {
aggBuilder.script(scriptTemplate.asScript());
}
}
}

This would improve LeafAgg which would improve toBuilder and call AggTarget so that subclasses would only have to create the ValueSourceAggregationBuilder. In fact we could just declare them as a method reference for a string:

public abstract class LeafAgg extends Agg {

    LeafAgg(String id, AggTarget target) {
        super(id, target);
    }

    ValuesSourceAggregationBuilder builder() {
         return aggBuilder.apply(id()).with(aggTarget);
    }

    protected abstract Function<String, ValuesSourceAggregationBuilder> aggBuilder();

so that subclasses would now look like:

public class CardinalityAgg extends LeafAgg {

    public CardinalityAgg(String id, AggTarget aggTarget) {
        super(id, aggTarget);
    }

    @Override
    ValuesSourceAggregationBuilder aggBuilder() {
        return AggregationBuilders::cardinality;
    }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @costin, It's cleaner this way!



// aggregates with scalars
aggregateFunctionsWithScalars
Copy link
Copy Markdown
Member

@costin costin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already:

I was thinking of a test with the agg declared just inside the order and not referencing an already declared aggregation.

Copy link
Copy Markdown
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Override
public int hashCode() {
return Objects.hash(id, fieldName, scriptTemplate);
return Objects.hash(id) + target.hashCode();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Objects.hash(id, target)

Copy link
Copy Markdown
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AggSource conveys better the intent than AggTarget.

abstract AggregationBuilder toBuilder();

@SuppressWarnings("rawtypes")
protected ValuesSourceAggregationBuilder addFieldOrScript(ValuesSourceAggregationBuilder builder) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is leaking implementation details - why not addTarget()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, forgot to rename that, thx.

Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This PR will make a lot of users happy.
Left few minor comments.

not allowed and an error is returned:
[source, sql]
---------------------------------------
SELECT KURTOSIS(salary), gender FROM emp GROUP BY gender
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the NOTE I would have expected to see "KURTOSIS ... on top of scalar functions" but I'm seeing KURTOSIS applied on a field.


aggregatesWithScalarsAndGroupByOrderByAggWithoutProjection
schema::gender:s
SELECT gender FROM test_emp GROUP BY gender ORDER BY MAX(salary % 100) DESC;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add MAX(salary % 100) to the result.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a special case, requested by @costin, ORDER BY without having it in the projection.

18601.965319409886|3.4461553130896095E10|3.460331137445282E8 |null
17811.071545718776|1.2151168881502939E11|3.1723426960671306E8|F
15904.093950318531|1.699198993070239E11 |2.529402043805585E8 |M
;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the complexities of these tests, but the queries are too complex for me to assess that the example is relevant :-). I trust that you checked the numbers against the data and made sure everything is ok.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested all the queries against PostgreSQL and the results are not always the same but very similar (double-rounding diffs).

return new MatrixStatsAgg(id, singletonList(field(m, m.field())));
}
throw new SqlIllegalArgumentException(
"Cannot use scalar functions or operators: [{}] in matrix stats " + "aggregate functions [KURTOSIS] and [SKEWNESS]",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this error message be improved?

  • why is it mentioning matrix stats, isn't this an implementation detail?
  • not a biggie and can be left as is, but can the message be more targeted to the exact aggregate function? (mention only KURTOSIS or SKEWNESS)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • why is it mentioning matrix stats, isn't this an implementation detail?

Yep, I can remove that.

  • not a biggie and can be left as is, but can the message be more targeted to the exact aggregate function? (mention only KURTOSIS or SKEWNESS)

Not really, I've tried it, because we don't have the original function anymore here, only MatrixStats. I was considering checking that earlier on the functions themselves, but then resolveType() didn't seem to me a good place to add such a verification, since it's regarding the DataType and not if the arg is field or function.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better handled in the verifier by checking that Kurtosis & co (namely MatrixStatsEnclosed) don't have a scalar expression as argument.
Further more I would add an issue to extend the aggregation to support scripts not just fields.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created issue: #55344

Agg(String id, AggTarget target) {
this.id = id;
this.fieldName = fieldName;
Objects.requireNonNull(target, "AggTarget must not be null");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could move this call as the first in the constructor, no?

@@ -397,6 +397,16 @@ https://en.wikipedia.org/wiki/Kurtosis[Quantify] the shape of the distribution o
include-tagged::{sql-specs}/docs/docs.csv-spec[aggKurtosis]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth adding examples in the docs (either a dedicated one or for most aggregations) that shows aggregations alongside a scalar function as it might not be obvious, especially for existing users.

This would also address @astefan point of having a test that can be easily be verified.

Copy link
Copy Markdown
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@matriv matriv merged commit 506d1be into elastic:master Apr 17, 2020
@matriv matriv deleted the impl-29980 branch April 17, 2020 09:22
costin added a commit to costin/elasticsearch that referenced this pull request Apr 17, 2020
Avoid repetition of the aggregation builder setup

Relates elastic#55241
matriv added a commit that referenced this pull request Apr 17, 2020
Implement the use of scalar functions inside aggregate functions.
This allows for complex expressions inside aggregations, with or without
GROUBY as well as with or without a HAVING clause. e.g.:

```
SELECT MAX(CASE WHEN a IS NULL then -1 ELSE abs(a * 10) + 1 END) AS max, b
FROM test
GROUP BY b
HAVING MAX(CASE WHEN a IS NULL then -1 ELSE abs(a * 10) + 1 END) > 5
```

Scalar functions are still not allowed for `KURTOSIS` and `SKEWNESS` as
this is currently not implemented on the ElasticSearch side.

Fixes: #29980
Fixes: #36865
Fixes: #37271

(cherry picked from commit 506d1be)
costin added a commit that referenced this pull request Apr 17, 2020
Avoid repetition of the aggregation builder setup

Relates #55241
costin added a commit that referenced this pull request Apr 17, 2020
Avoid repetition of the aggregation builder setup

Relates #55241

(cherry picked from commit 6cfe130)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

6 participants