ESQL: Added mv_percentile function by ivancea · Pull Request #111749 · elastic/elasticsearch

ivancea · 2024-08-09T12:32:47Z

Added the mv_percentile(values, percentile) function
Used as a surrogate in the percentile(column, percentile) aggregation
Updated docs to specify that the surrogate should be implemented if possible

The same way as mv_median does, this yields exact results (Ignoring double operations error).
For that, some decisions were made, specially in the long evaluator (Check the comments in context in MvPercentile.java)

Closes #111591

elasticsearchmachine · 2024-08-09T12:35:06Z

Hi @ivancea, I've created a changelog YAML for you.

ivancea · 2024-08-09T14:03:03Z

...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java

+                assert lowerIndex >= 0 && upperIndex < valueCount;
+                var lowerValue = valuesBlock.getInt(lowerIndex);
+                var upperValue = valuesBlock.getInt(upperIndex);
+                var difference = (long) upperValue - lowerValue;


To avoid overflowing ints, I'm casting to long. Should be as trivial as the double trick

ivancea · 2024-08-09T14:06:26Z

...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java

+            return lowerValue + (long) (fraction * difference);
+        }
+
+        var lowerValueBigDecimal = new BigDecimal(lowerValue);


To avoid overflowing or having precision issues on large longs, I'm operating with BigDecimals instead.

I have doubts about this. Do we prefer speed or precision for big longs?

ivancea · 2024-08-09T14:09:49Z

...c/test/java/org/elasticsearch/xpack/esql/expression/function/MultivalueTestCaseSupplier.java

I think it makes sense to have such cases here, like the ones we have for single values and aggregations

But if you are going to do it you'll need to migrate the callers in a follow up change. Not good to have two ways to do it.

Sure. Added it to my to-dos, and created an issue just in case: #112021

ivancea · 2024-08-09T14:11:26Z

...va/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentileTests.java

+        throw new IllegalArgumentException("Unsupported type: " + rawValues.get(0).getClass());
+    }
+
+    private static BigDecimal calculatePercentile(double fraction, BigDecimal lowerValue, BigDecimal upperValue) {


As to avoid duplicating the exact logic there's in the original function, I'm always using BigDecimals gere to generate the cases (Except for ints, which is trivial).

Big chunk of code, but didn't find a better way

...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java

elasticsearchmachine · 2024-08-12T11:17:27Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2024-08-12T11:17:27Z

Pinging @elastic/kibana-esql (ES|QL-ui)

ivancea · 2024-08-14T10:04:43Z

@elasticmachine update branch

ivancea · 2024-08-14T10:44:42Z

...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java

+            }
+        }
+
+        var values = new double[valueCount];


Maybe move this to a @fixed? With circuit breaking

If you want to @Fixed it then it'd be a little scatch object so you can mutate the length of the array. Generally I like that.

FWIW if percentile is foldable and 0 you can run this as MV_MIN, right? Same for 100 and MV_MAX? That's kind of cute.

If by "running as" you mean using the surrogate system, I think it was only thought for aggregations, and only work on them right now.
I think it would be nice to have it everywhere as a generic rule for any expression, but it requires some extra work first

nik9000 · 2024-08-14T14:11:57Z

...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java

+            }
+        }
+
+        var values = new double[valueCount];


If you want to @Fixed it then it'd be a little scatch object so you can mutate the length of the array. Generally I like that.

FWIW if percentile is foldable and 0 you can run this as MV_MIN, right? Same for 100 and MV_MAX? That's kind of cute.

...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java

astefan

LGTM

nik9000

I've had some opinions at it.

nik9000 · 2024-08-19T17:44:37Z

...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java

+        var percentileEval = toEvaluator.apply(percentile);
+
+        return switch (PlannerUtils.toElementType(field.dataType())) {
+            case INT -> switch (PlannerUtils.toElementType(percentile.dataType())) {


@not-napoleon has convinced me we should probably migrate from these kind of switch statements to a static Map<List<DataType>, SomeClosure> for this sort of thing. That way resolveType can check if the data type is in the Map and we don't have to maintain a sort of parallel infrastructure for functions.

Not sure it's worth doing now, but I think it's worth thinking about.

I'll leave it out of this PR, as that idea looks nice, but there are edge cases that check other things apart from DataTypes (Like SpatialCentroid). So better to think it in parallel

nik9000 · 2024-08-19T17:46:36Z

...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java

+    }
+
+    @Evaluator(extraName = "IntegerLong", warnExceptions = IllegalArgumentException.class)
+    static void process(


I think i might have used Cast.cast to force percentile to always be a double. I'm not sure it's worth having an evaluator for each one.

Oh, I didn't know we had such magic. Checking it! It may simplify things a bit. At least removing a layer of functions 👀

nik9000 · 2024-08-19T17:49:59Z

...c/test/java/org/elasticsearch/xpack/esql/expression/function/MultivalueTestCaseSupplier.java

+    public static List<TypedDataSupplier> intCases(int min, int max, boolean includeZero) {
+        List<TypedDataSupplier> cases = new ArrayList<>();
+
+        for (Block.MvOrdering ordering : Block.MvOrdering.values()) {


Interesting! I can see why you'd do it. What's the runtime difference on one of the tests? I can imagine it's a few seconds so it's worth it.

Another option is to randomize this bit. That's less cases at least.

nik9000 · 2024-08-19T17:50:56Z

...c/test/java/org/elasticsearch/xpack/esql/expression/function/MultivalueTestCaseSupplier.java

nik9000 · 2024-08-19T17:52:07Z

...c/test/java/org/elasticsearch/xpack/esql/expression/function/MultivalueTestCaseSupplier.java

But if you are going to do it you'll need to migrate the callers in a follow up change. Not good to have two ways to do it.

nik9000 · 2024-08-19T17:53:01Z

...c/test/java/org/elasticsearch/xpack/esql/expression/function/MultivalueTestCaseSupplier.java

+    public static List<TypedDataSupplier> intCases(int min, int max, boolean includeZero) {
+        List<TypedDataSupplier> cases = new ArrayList<>();
+
+        for (Block.MvOrdering ordering : Block.MvOrdering.values()) {


On second inspection AbstractMultivalueTestCase's generators already have this. So I'd just keep it and not worry about it. For MV functions it's quite an important bit of their behavior.

nik9000 · 2024-08-20T13:01:41Z

...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java

+    @Override
+    public final ExpressionEvaluator.Factory toEvaluator(Function<Expression, ExpressionEvaluator.Factory> toEvaluator) {
+        var fieldEval = toEvaluator.apply(field);
+        var percentileEval = Cast.cast(source(), percentile.dataType(), DOUBLE, toEvaluator.apply(percentile));


See how nice and short this one is!

- Added the `mv_percentile(values, percentile)` function - Used as a surrogate in the `percentile(column, percentile)` aggregation - Updated docs to specify that the surrogate _should_ be implemented if possible The same way as mv_median does, this yields exact results (Ignoring double operations error). For that, some decisions were made, specially in the long evaluator (Check the comments in context in `MvPercentile.java`) Closes elastic#111591

ivancea added 6 commits August 8, 2024 16:25

Added MvPercentile function and tests (WIP)

bf5b318

Fixed int and long calculations

e0d533c

Fixed overflows for doubles

e0b2588

Remove BigDecimals from double calculation, and simplify tests

1306d49

Registered fucntion and added basic CSV tests

0bc7950

Fixed capabilities and added fraction multi-type tests

112822d

ivancea added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL ES|QL-ui Impacts ES|QL UI labels Aug 9, 2024

elasticsearchmachine added the v8.16.0 label Aug 9, 2024

ivancea added the >feature label Aug 9, 2024

Update docs/changelog/111749.yaml

ca82495

Updated meta tests

dc64dbf

ivancea commented Aug 9, 2024

View reviewed changes

Update comment for doubles

97833fa

ivancea commented Aug 9, 2024

View reviewed changes

fang-xing-esql reviewed Aug 9, 2024

View reviewed changes

...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java Outdated Show resolved Hide resolved

ivancea added 5 commits August 12, 2024 11:35

Fixed foldability and forbidden API

538936c

Merge branch 'main' into mv-percentile

0c1f7d9

Added surrogate to Percentile, and adjusted docs

29e1126

Added docs

441bc3c

Added some extra csv cases

b96fee3

ivancea requested review from astefan and nik9000 August 12, 2024 11:16

ivancea marked this pull request as ready for review August 12, 2024 11:17

ivancea added 4 commits August 12, 2024 18:22

Added docs on the MvPercentile percentile parameter limits

81f85d5

Extra tests and minor improvements

dd4f4fe

Merge branch 'main' into mv-percentile

c0cdd10

Added warnings to wrong percentile values

a8f38d6

Merge branch 'main' into mv-percentile

9f04a41

ivancea commented Aug 14, 2024

View reviewed changes

ivancea added 2 commits August 14, 2024 13:30

Added percentile surrogate csv tests

a57d09a

Require mv_percentile on percentile constants tests

ab0b1cd

nik9000 reviewed Aug 14, 2024

View reviewed changes

ivancea added 3 commits August 14, 2024 17:23

Added extra checks on tests to verify blocks position count

ab8ee5c

Added missing returns to sorted cases

820acdc

Minor refactor and improved algorithm for 0 and 100 percentiles

a07b068

ivancea requested review from astefan, fang-xing-esql and nik9000 August 14, 2024 16:19

ivancea added 2 commits August 19, 2024 11:58

Merge branch 'main' into mv-percentile

8ce2172

Added scratch for arrays

28abad8

ivancea mentioned this pull request Aug 19, 2024

[ES|QL] Cast mixed numeric types to the first not null numeric type for Coalesce at Analyzer #111917

Merged

astefan approved these changes Aug 19, 2024

View reviewed changes

nik9000 reviewed Aug 19, 2024

View reviewed changes

Use Cast.cast() instead of multiple evaluators

ee27d0d

ivancea requested a review from nik9000 August 20, 2024 12:29

ivancea mentioned this pull request Aug 20, 2024

ESQL: Migrate MV function test cases to use MultivalueTestCaseSupplier #112021

Open

nik9000 approved these changes Aug 20, 2024

View reviewed changes

ivancea merged commit 65ce50c into elastic:main Aug 20, 2024

ivancea deleted the mv-percentile branch August 20, 2024 13:29

Conversation

ivancea commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 12, 2024

Uh oh!

elasticsearchmachine commented Aug 12, 2024

Uh oh!

ivancea commented Aug 14, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

astefan left a comment

Choose a reason for hiding this comment

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivancea Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ivancea commented Aug 9, 2024 •

edited

Loading

ivancea Aug 20, 2024 •

edited

Loading