ESQL: Added mv_percentile function#111749
Conversation
|
Hi @ivancea, I've created a changelog YAML for you. |
| assert lowerIndex >= 0 && upperIndex < valueCount; | ||
| var lowerValue = valuesBlock.getInt(lowerIndex); | ||
| var upperValue = valuesBlock.getInt(upperIndex); | ||
| var difference = (long) upperValue - lowerValue; |
There was a problem hiding this comment.
To avoid overflowing ints, I'm casting to long. Should be as trivial as the double trick
| return lowerValue + (long) (fraction * difference); | ||
| } | ||
|
|
||
| var lowerValueBigDecimal = new BigDecimal(lowerValue); |
There was a problem hiding this comment.
To avoid overflowing or having precision issues on large longs, I'm operating with BigDecimals instead.
I have doubts about this. Do we prefer speed or precision for big longs?
There was a problem hiding this comment.
I think it makes sense to have such cases here, like the ones we have for single values and aggregations
There was a problem hiding this comment.
But if you are going to do it you'll need to migrate the callers in a follow up change. Not good to have two ways to do it.
There was a problem hiding this comment.
Sure. Added it to my to-dos, and created an issue just in case: #112021
| throw new IllegalArgumentException("Unsupported type: " + rawValues.get(0).getClass()); | ||
| } | ||
|
|
||
| private static BigDecimal calculatePercentile(double fraction, BigDecimal lowerValue, BigDecimal upperValue) { |
There was a problem hiding this comment.
As to avoid duplicating the exact logic there's in the original function, I'm always using BigDecimals gere to generate the cases (Except for ints, which is trivial).
Big chunk of code, but didn't find a better way
...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java
Outdated
Show resolved
Hide resolved
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Pinging @elastic/kibana-esql (ES|QL-ui) |
|
@elasticmachine update branch |
| } | ||
| } | ||
|
|
||
| var values = new double[valueCount]; |
There was a problem hiding this comment.
Maybe move this to a @fixed? With circuit breaking
There was a problem hiding this comment.
If you want to @Fixed it then it'd be a little scatch object so you can mutate the length of the array. Generally I like that.
FWIW if percentile is foldable and 0 you can run this as MV_MIN, right? Same for 100 and MV_MAX? That's kind of cute.
There was a problem hiding this comment.
If by "running as" you mean using the surrogate system, I think it was only thought for aggregations, and only work on them right now.
I think it would be nice to have it everywhere as a generic rule for any expression, but it requires some extra work first
| } | ||
| } | ||
|
|
||
| var values = new double[valueCount]; |
There was a problem hiding this comment.
If you want to @Fixed it then it'd be a little scatch object so you can mutate the length of the array. Generally I like that.
FWIW if percentile is foldable and 0 you can run this as MV_MIN, right? Same for 100 and MV_MAX? That's kind of cute.
...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java
Show resolved
Hide resolved
...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvPercentile.java
Outdated
Show resolved
Hide resolved
nik9000
left a comment
There was a problem hiding this comment.
I've had some opinions at it.
| var percentileEval = toEvaluator.apply(percentile); | ||
|
|
||
| return switch (PlannerUtils.toElementType(field.dataType())) { | ||
| case INT -> switch (PlannerUtils.toElementType(percentile.dataType())) { |
There was a problem hiding this comment.
@not-napoleon has convinced me we should probably migrate from these kind of switch statements to a static Map<List<DataType>, SomeClosure> for this sort of thing. That way resolveType can check if the data type is in the Map and we don't have to maintain a sort of parallel infrastructure for functions.
Not sure it's worth doing now, but I think it's worth thinking about.
There was a problem hiding this comment.
I'll leave it out of this PR, as that idea looks nice, but there are edge cases that check other things apart from DataTypes (Like SpatialCentroid). So better to think it in parallel
| } | ||
|
|
||
| @Evaluator(extraName = "IntegerLong", warnExceptions = IllegalArgumentException.class) | ||
| static void process( |
There was a problem hiding this comment.
I think i might have used Cast.cast to force percentile to always be a double. I'm not sure it's worth having an evaluator for each one.
There was a problem hiding this comment.
Oh, I didn't know we had such magic. Checking it! It may simplify things a bit. At least removing a layer of functions 👀
| public static List<TypedDataSupplier> intCases(int min, int max, boolean includeZero) { | ||
| List<TypedDataSupplier> cases = new ArrayList<>(); | ||
|
|
||
| for (Block.MvOrdering ordering : Block.MvOrdering.values()) { |
There was a problem hiding this comment.
Interesting! I can see why you'd do it. What's the runtime difference on one of the tests? I can imagine it's a few seconds so it's worth it.
Another option is to randomize this bit. That's less cases at least.
There was a problem hiding this comment.
But if you are going to do it you'll need to migrate the callers in a follow up change. Not good to have two ways to do it.
| public static List<TypedDataSupplier> intCases(int min, int max, boolean includeZero) { | ||
| List<TypedDataSupplier> cases = new ArrayList<>(); | ||
|
|
||
| for (Block.MvOrdering ordering : Block.MvOrdering.values()) { |
There was a problem hiding this comment.
On second inspection AbstractMultivalueTestCase's generators already have this. So I'd just keep it and not worry about it. For MV functions it's quite an important bit of their behavior.
| @Override | ||
| public final ExpressionEvaluator.Factory toEvaluator(Function<Expression, ExpressionEvaluator.Factory> toEvaluator) { | ||
| var fieldEval = toEvaluator.apply(field); | ||
| var percentileEval = Cast.cast(source(), percentile.dataType(), DOUBLE, toEvaluator.apply(percentile)); |
There was a problem hiding this comment.
See how nice and short this one is!
- Added the `mv_percentile(values, percentile)` function - Used as a surrogate in the `percentile(column, percentile)` aggregation - Updated docs to specify that the surrogate _should_ be implemented if possible The same way as mv_median does, this yields exact results (Ignoring double operations error). For that, some decisions were made, specially in the long evaluator (Check the comments in context in `MvPercentile.java`) Closes elastic#111591
- Added the `mv_percentile(values, percentile)` function - Used as a surrogate in the `percentile(column, percentile)` aggregation - Updated docs to specify that the surrogate _should_ be implemented if possible The same way as mv_median does, this yields exact results (Ignoring double operations error). For that, some decisions were made, specially in the long evaluator (Check the comments in context in `MvPercentile.java`) Closes elastic#111591
- Added the `mv_percentile(values, percentile)` function - Used as a surrogate in the `percentile(column, percentile)` aggregation - Updated docs to specify that the surrogate _should_ be implemented if possible The same way as mv_median does, this yields exact results (Ignoring double operations error). For that, some decisions were made, specially in the long evaluator (Check the comments in context in `MvPercentile.java`) Closes elastic#111591
mv_percentile(values, percentile)functionpercentile(column, percentile)aggregationThe same way as mv_median does, this yields exact results (Ignoring double operations error).
For that, some decisions were made, specially in the long evaluator (Check the comments in context in
MvPercentile.java)Closes #111591