Skip to content

ESQL: LEAST and GREATEST functions#98630

Merged
elasticsearchmachine merged 13 commits intoelastic:mainfrom
nik9000:column_min
Aug 22, 2023
Merged

ESQL: LEAST and GREATEST functions#98630
elasticsearchmachine merged 13 commits intoelastic:mainfrom
nik9000:column_min

Conversation

@nik9000
Copy link
Copy Markdown
Member

@nik9000 nik9000 commented Aug 17, 2023

Adds LEAST and GREATEST functions to find the min or max of the values in many columns.

Adds `LEAST` and `GREATEST` functions to find the min or max of
the values in many colunms.
@github-actions
Copy link
Copy Markdown
Contributor

Documentation preview:

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine elasticsearchmachine added the Team:QL (Deprecated) Meta label for query languages team label Aug 17, 2023
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I believe more CSV tests are needed, especially for real (non-row command) data in employees table.

For example, from employees | eval min = least(still_hired), max = greatest(still_hired) | sort emp_no | limit 10 | keep still_hired, min, max fails with

{
    "error": {
        "root_cause": [
            {
                "type": "unsupported_operation_exception",
                "reason": null
            }
        ],
        "type": "unsupported_operation_exception",
        "reason": null
    },
    "status": 500
}

Wondering if a nicer error message isn't more appropriate. If I replace least and greatest with min and max I get something more informative:

{
    "error": {
        "root_cause": [
            {
                "type": "verification_exception",
                "reason": "Found 2 problems\nline 1:29: argument of [min(still_hired)] must be [numeric or datetime], found value [min(still_hired)] type [boolean]\nline 1:53: argument of [max(still_hired)] must be [numeric or datetime], found value [max(still_hired)] type [boolean]"
            }
        ],
        "type": "verification_exception",
        "reason": "Found 2 problems\nline 1:29: argument of [min(still_hired)] must be [numeric or datetime], found value [min(still_hired)] type [boolean]\nline 1:53: argument of [max(still_hired)] must be [numeric or datetime], found value [max(still_hired)] type [boolean]"
    },
    "status": 400
}

I think it would be useful to test the null default handling for multi-values. For example from employees | eval min = least(job_positions), max = greatest(job_positions) | sort emp_no | limit 10 | keep job_positions, min, max.


least
// tag::least[]
ROW a = 10, b=20
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespacing is inconsistent.


greatest
// tag::greatest[]
ROW a = 10, b=20
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here about inconsistency.

include::{esql-specs}/math.csv-spec[tag=greatest-result]
|===

NOTE: When run on `keyword` or `text` fields, this'll return the last string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this about multi-values support?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are like most of our other functions at this point and don't support multivalued fields. This is a case where it'd make a lot of sense to take, like, mv_min for least.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's about what LEAST and GREATEST mean when you use them with string style fields. Let me see about clarifying.

Copy link
Copy Markdown
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM, I only wonder about handling/testing of some edge cases like empty varargs or multi-value fields.

[[esql-greatest]]
=== `GREATEST`

Returns the maximum value from many columns.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit/question: does it make sense to mention behavior for multi-value columns? I'd be tempted to do something like

row a=[1,2,3], b = 0 | eval greatest(a,b)

and would expect to receive either 0 or an error, or null with a warning.

Comment on lines +1133 to +1138
static final Map<String, BiFunction<Source, List<Expression>, ScalarFunction>> VARARG_CTORS = Map.ofEntries(
entry(name(Case.class), Case::new),
entry(name(Coalesce.class), Coalesce::new),
entry(name(Greatest.class), Greatest::new),
entry(name(Least.class), Least::new)
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (I'm not fully familiar with the code base yet, so this is going to sound naive): It seems like instead of a map, one could also define an enum representing the vararg functions, or maybe have them all implement an interface that is used by vararg functions. Why do we go with a map?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could all implement an interface, namely Writeable. Then we wouldn't have any of these. But we ended up going this way to prevent us from having to implement that interface on QL code. To be honest, I wonder if, now that we've merged to main, we should back that out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing Writable means touching all the QL code (expression, attribute, etc..) instead of having one class that handles all the messy details of writing and another for reading.

/**
* Builds test cases for variable argument functions.
*/
public class VaragsTestCases {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming bikeshed (might be inconsistent with other names, is meant absolutely optional): Wouldn't VarargsTestCaseBuilder be more precise? At first glance, I thought that an instance of this represents a fixed test case, rather than supplying an arbitrary number of test cases.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VarargsTestCaseBuilder is more precise I think.


import static org.hamcrest.Matchers.equalTo;

public class GreatestTests extends AbstractFunctionTestCase {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe a test case would be nice that documents behavior unter multi-value fields, even if it just documents that multi-value fueilds results in an error/warning.


@Evaluator(extraName = "Int")
static int process(int[] values) {
int max = values[0];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can't this be out of bounds? Not sure the case is handled when the number of varargs is zero.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function requires at least one argument. Or, at least, it should. Let me make sure there's a test for that case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I've made this error a few times!


g:integer
1234
;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question/nit: do we allow varargs functions to take 0 arguments? IMHO, a test case for this would be good, either as int test (if it's allowed) or as unit test (if it's supposed to error out).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding some more tests asserting that we don't support 0 arguments. Actually, none of our varargs functions should allow 0 arguments. That requires a bit of fiddling on the constructors which I'll do.

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Aug 21, 2023

I'd also add one more test that's a bit more complex: from employees | eval min_plus_max = to_int(least(salary_change) + greatest(salary_change)), double = to_int(salary_change * 2), are_equal = to_int(least(salary_change) + greatest(salary_change)) == to_int(salary_change * 2) | sort emp_no | limit 10 | keep salary_change, min_plus_max, double, are_equal

Copy link
Copy Markdown
Member Author

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alex-spies and @astefan this is ready for another round I think! I've added multivalue fields to the VarargsTestCaseBuilder. I've added a little bit to the docs. But just a little. Let me know what you think!

Copy link
Copy Markdown
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Only left remarks/questions that are all meant to be highly optional.

import java.util.function.Function;
import java.util.function.Supplier;
import java.util.stream.IntStream;
import java.util.stream.Stream;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: yay for complie-time enforcement of at least 1 argument.

);
}
throw new UnsupportedOperationException();
throw new QlIllegalArgumentException("unsupported type [" + dataType + "]");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can use EsqlIllegalArgumentException::illegalDataType, although this code should likely never be reached, anyway.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know this was there. Happy to use it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alex-spies Better to move that utility method from the exception class to an util class specialized for the topic - see TypeResolutions/ SqlTypeResolution(we would have to make on for ESQL) orEsqlDataTypes`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will move that @costin!

@Override
protected Greatest build(Source source, List<Expression> args) {
return new Greatest(Source.EMPTY, args.stream().toList());
return new Greatest(Source.EMPTY, args.get(0), args.subList(1, args.size()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe worth adding a comment why the source argument is unused/ignored.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't ignore it. It doesn't really add much to the test, but if we have it we should pass it.

return this;
}

public VaragsTestCaseBuilder expectBoolean(Function<Stream<boolean[]>, Optional<boolean[]>> expectedBoolean) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: I wonder if we could reduce the code duplication in this file via generics, although that may be more pain than is worth it given that we probably need unboxed primitives.

|===

NOTE: When run on `keyword` or `text` fields, this'll return the first string
in alphabetical order. When run on `boolean` columns this will return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be "... this will return false if any values are false".

10
;

leastGreatestMany
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* 2.0.
*/

package org.elasticsearch.xpack.esql.expression.function.scalar.math;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't really matter, but the package this is in is kind of an unusual choice. Is not really a great option, but if I would have to choose I'd go with org.elasticsearch.xpack.esql.expression.function.scalar.conditional instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is like max, I can move it.

Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@nik9000 nik9000 added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 22, 2023
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine merged commit 65ea90d into elastic:main Aug 22, 2023
@nik9000 nik9000 deleted the column_min branch August 22, 2023 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >feature Team:QL (Deprecated) Meta label for query languages team v8.11.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants