Skip to content

ESQL: Added mv_percentile function#111749

Merged
ivancea merged 27 commits intoelastic:mainfrom
ivancea:mv-percentile
Aug 20, 2024
Merged

ESQL: Added mv_percentile function#111749
ivancea merged 27 commits intoelastic:mainfrom
ivancea:mv-percentile

Conversation

@ivancea
Copy link
Copy Markdown
Contributor

@ivancea ivancea commented Aug 9, 2024

  • Added the mv_percentile(values, percentile) function
  • Used as a surrogate in the percentile(column, percentile) aggregation
  • Updated docs to specify that the surrogate should be implemented if possible

The same way as mv_median does, this yields exact results (Ignoring double operations error).
For that, some decisions were made, specially in the long evaluator (Check the comments in context in MvPercentile.java)

Closes #111591

@ivancea ivancea added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL ES|QL-ui Impacts ES|QL UI labels Aug 9, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ivancea, I've created a changelog YAML for you.

assert lowerIndex >= 0 && upperIndex < valueCount;
var lowerValue = valuesBlock.getInt(lowerIndex);
var upperValue = valuesBlock.getInt(upperIndex);
var difference = (long) upperValue - lowerValue;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid overflowing ints, I'm casting to long. Should be as trivial as the double trick

return lowerValue + (long) (fraction * difference);
}

var lowerValueBigDecimal = new BigDecimal(lowerValue);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid overflowing or having precision issues on large longs, I'm operating with BigDecimals instead.

I have doubts about this. Do we prefer speed or precision for big longs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have such cases here, like the ones we have for single values and aggregations

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you are going to do it you'll need to migrate the callers in a follow up change. Not good to have two ways to do it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Added it to my to-dos, and created an issue just in case: #112021

throw new IllegalArgumentException("Unsupported type: " + rawValues.get(0).getClass());
}

private static BigDecimal calculatePercentile(double fraction, BigDecimal lowerValue, BigDecimal upperValue) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to avoid duplicating the exact logic there's in the original function, I'm always using BigDecimals gere to generate the cases (Except for ints, which is trivial).

Big chunk of code, but didn't find a better way

@ivancea ivancea requested review from astefan and nik9000 August 12, 2024 11:16
@ivancea ivancea marked this pull request as ready for review August 12, 2024 11:17
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/kibana-esql (ES|QL-ui)

@ivancea
Copy link
Copy Markdown
Contributor Author

ivancea commented Aug 14, 2024

@elasticmachine update branch

}
}

var values = new double[valueCount];
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this to a @fixed? With circuit breaking

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to @Fixed it then it'd be a little scatch object so you can mutate the length of the array. Generally I like that.

FWIW if percentile is foldable and 0 you can run this as MV_MIN, right? Same for 100 and MV_MAX? That's kind of cute.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If by "running as" you mean using the surrogate system, I think it was only thought for aggregations, and only work on them right now.
I think it would be nice to have it everywhere as a generic rule for any expression, but it requires some extra work first

}
}

var values = new double[valueCount];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to @Fixed it then it'd be a little scatch object so you can mutate the length of the array. Generally I like that.

FWIW if percentile is foldable and 0 you can run this as MV_MIN, right? Same for 100 and MV_MAX? That's kind of cute.

Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had some opinions at it.

var percentileEval = toEvaluator.apply(percentile);

return switch (PlannerUtils.toElementType(field.dataType())) {
case INT -> switch (PlannerUtils.toElementType(percentile.dataType())) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@not-napoleon has convinced me we should probably migrate from these kind of switch statements to a static Map<List<DataType>, SomeClosure> for this sort of thing. That way resolveType can check if the data type is in the Map and we don't have to maintain a sort of parallel infrastructure for functions.

Not sure it's worth doing now, but I think it's worth thinking about.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it out of this PR, as that idea looks nice, but there are edge cases that check other things apart from DataTypes (Like SpatialCentroid). So better to think it in parallel

}

@Evaluator(extraName = "IntegerLong", warnExceptions = IllegalArgumentException.class)
static void process(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i might have used Cast.cast to force percentile to always be a double. I'm not sure it's worth having an evaluator for each one.

Copy link
Copy Markdown
Contributor Author

@ivancea ivancea Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know we had such magic. Checking it! It may simplify things a bit. At least removing a layer of functions 👀

public static List<TypedDataSupplier> intCases(int min, int max, boolean includeZero) {
List<TypedDataSupplier> cases = new ArrayList<>();

for (Block.MvOrdering ordering : Block.MvOrdering.values()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I can see why you'd do it. What's the runtime difference on one of the tests? I can imagine it's a few seconds so it's worth it.

Another option is to randomize this bit. That's less cases at least.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you are going to do it you'll need to migrate the callers in a follow up change. Not good to have two ways to do it.

public static List<TypedDataSupplier> intCases(int min, int max, boolean includeZero) {
List<TypedDataSupplier> cases = new ArrayList<>();

for (Block.MvOrdering ordering : Block.MvOrdering.values()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second inspection AbstractMultivalueTestCase's generators already have this. So I'd just keep it and not worry about it. For MV functions it's quite an important bit of their behavior.

@Override
public final ExpressionEvaluator.Factory toEvaluator(Function<Expression, ExpressionEvaluator.Factory> toEvaluator) {
var fieldEval = toEvaluator.apply(field);
var percentileEval = Cast.cast(source(), percentile.dataType(), DOUBLE, toEvaluator.apply(percentile));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See how nice and short this one is!

@ivancea ivancea merged commit 65ce50c into elastic:main Aug 20, 2024
@ivancea ivancea deleted the mv-percentile branch August 20, 2024 13:29
lkts pushed a commit to lkts/elasticsearch that referenced this pull request Aug 20, 2024
- Added the `mv_percentile(values, percentile)` function
- Used as a surrogate in the `percentile(column, percentile)` aggregation
- Updated docs to specify that the surrogate _should_ be implemented if possible

The same way as mv_median does, this yields exact results (Ignoring double operations error).
For that, some decisions were made, specially in the long evaluator (Check the comments in context in `MvPercentile.java`)

Closes elastic#111591
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
- Added the `mv_percentile(values, percentile)` function
- Used as a surrogate in the `percentile(column, percentile)` aggregation
- Updated docs to specify that the surrogate _should_ be implemented if possible

The same way as mv_median does, this yields exact results (Ignoring double operations error).
For that, some decisions were made, specially in the long evaluator (Check the comments in context in `MvPercentile.java`)

Closes elastic#111591
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this pull request Sep 5, 2024
- Added the `mv_percentile(values, percentile)` function
- Used as a surrogate in the `percentile(column, percentile)` aggregation
- Updated docs to specify that the surrogate _should_ be implemented if possible

The same way as mv_median does, this yields exact results (Ignoring double operations error).
For that, some decisions were made, specially in the long evaluator (Check the comments in context in `MvPercentile.java`)

Closes elastic#111591
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL ES|QL-ui Impacts ES|QL UI >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.16.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ESQL: implement mv_percentile function

6 participants