SQL: Implement FIRST/LAST aggregate functions#37936
Conversation
FIRST and LAST can be used with one argument and work similarly to MIN and MAX but they are implemented using a Top Hits aggregation and therefore can also operate on keyword fields. When a second argument is provided then they return the first/last value of the first arg when its values are ordered ascending/descending (respectively) by the values of the second argument. Currently because of the usage of a Top Hits aggregation FIRST and LAST cannot be used in the HAVING clause of a GROUP BY query to filter on the results of the aggregation. Closes: elastic#35639
|
Pinging @elastic/es-search |
costin
left a comment
There was a problem hiding this comment.
Looks good overall however there some things that need improving.
In particular the rewrite optimization I don't think is complete and probably needs a follow-up PR.
|
|
||
| .Synopsis: | ||
| [source, sql] | ||
| -------------------------------------------------- |
There was a problem hiding this comment.
I think for optional arguments, it would be nice to add an example with FIRST just with the first argument.
There was a problem hiding this comment.
You mean here in the synopsis? because we have an example here: https://github.com/elastic/elasticsearch/pull/37936/files/1be7ba194c950ab175ca0e35d927ab0b54306be2#diff-f71f0c2cd27202241aec808b033b8510R166
There was a problem hiding this comment.
@costin our documentation, until now, lists one version of a function in synopsis where the optional arguments are between square brackets. See https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/string.asciidoc#locate and https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/math.asciidoc#round (and others). Listing all versions of a function (with/without optional params) imo would look a bit crowded and not-so-clean and sleek. In this particular case of FIRST/FIRST_VALUE/LAST/LAST_VALUE means we need to list four versions of the same function in the synopsis and I personally believe it doesn't look that great.
Do you believe that, maybe, it is not clear that parameter is optional? (the square brackets and the description saying "optional")
There was a problem hiding this comment.
It's not the optional parameter but the method name that I'd like to be exemplified through a query to make the function name more obvious.
It might just be easier to create a separate section for the alias, mentioning that it's an alias for ...
This can be done in a separate PR though since it's just documentation.
There was a problem hiding this comment.
| .Synopsis: | ||
| [source, sql] | ||
| -------------------------------------------------- | ||
| FIRST(field_name<1>, sort_by_field_name<2>) |
There was a problem hiding this comment.
sort_by_field_name -> ordering field
|
|
||
| *Input*: | ||
|
|
||
| <1> a field name |
There was a problem hiding this comment.
<1> target field for the aggregation
<2> optional field used for ordering
|
|
||
| [cols="<,<"] | ||
| |=== | ||
| s|a|b |
There was a problem hiding this comment.
Please use indentation to have the table columns aligned.
| -------------------------------------------------- | ||
|
|
||
| [[sql-functions-aggs-first]] | ||
| ===== `FIRST/FIRST_VALUE` |
There was a problem hiding this comment.
Additionally please add an example for FIRST_VALUE.
| return parameters().isEmpty() ? null : parameters().get(0); | ||
| } | ||
|
|
||
| public DataType sortFieldDataType() { |
There was a problem hiding this comment.
I don't think this method is needed - one can always say sortField().dataType().
There was a problem hiding this comment.
I did that to avoid the caller checking if the sortField() is null.
There was a problem hiding this comment.
All expressions return only one datatype - of their result, adding a new method is confusing. It's simple to do the null check - in fact, it's is mandatory for sorting since one cannot just add a null, null directive (if the sort field would be null).
| super(source, field, sortField != null ? Collections.singletonList(sortField) : Collections.emptyList()); | ||
| } | ||
|
|
||
| public Expression sortField() { |
There was a problem hiding this comment.
Any reason why sort is used instead of order/orderby?
There was a problem hiding this comment.
Not really, can change.
| AggregationBuilder toBuilder() { | ||
| // Sort missing values (NULLs) as last to get the first/last non-null value | ||
| List<SortBuilder<?>> sortBuilderList; | ||
| if (sortField != null) { |
There was a problem hiding this comment.
The code here is redundant since the field ordering code is duplicated.
if (sortField!=null) {
sort.add(fieldSortBuilder(sortField);
}
sort.add(fieldSortBuilder(fieldName());
| // Sort missing values (NULLs) as last to get the first/last non-null value | ||
| List<SortBuilder<?>> sortBuilderList; | ||
| if (sortField != null) { | ||
| sortBuilderList = Arrays.asList( |
There was a problem hiding this comment.
To increase readability please use one method per line.
| } | ||
| } | ||
|
|
||
| public String format() { |
There was a problem hiding this comment.
Not okay conceptually and implementation wise since the base class is aware of its child type.
Make format() return DocValueFieldsContext by default and override DateTime and Date to return DATE_PARSE_FORMA - basic overriding.
There was a problem hiding this comment.
But this is an enum, I don't get it.
There was a problem hiding this comment.
Let me rephrase this - a dataType can have different formats. Currently we store that not in the DataType but rather inside the DateEsField as it can vary from field to field.
Ironically due to our usage of field_caps this information is not accessible anymore however it might be in the future.
DateEsField allows a custom format to be specified though I'm not sure this is something we need to use necessarily.
If the format could be used on DateEsField I would opt for that, if not, it can stay as is in the DataType but the two classes need to be aligned so the format is defined and stored only in one place.
There was a problem hiding this comment.
The getFormats() on DateEsField is currently not used anywhere so I'll remove it from there to avoid any confusion.
astefan
left a comment
There was a problem hiding this comment.
Looking good. Left few minor comments.
Also, I would like to see tests for all values NULL. Something like SELECT FIRST(whatever) FROM bla WHERE whatever IS NULL GROUP BY X.
Also, tests that catch the error generated when FIRST/LAST/MIN/MAX is used in HAVING.
| .Synopsis: | ||
| [source, sql] | ||
| -------------------------------------------------- | ||
| FIRST(field_name<1>, sort_by_field_name<2>) |
There was a problem hiding this comment.
If sort_by_field_name is optional, I think you need to put it between square brackets to remain consistent with the rest of the documentation.
|
|
||
| .Description: | ||
|
|
||
| When only one argument is provided it returns the first **non-NULL** value across input values in the field |
There was a problem hiding this comment.
"When only the first argument is provided".... (only the second one is optional).
There was a problem hiding this comment.
The first non-null value is returned regardless of the presence of the optional argument.
I would rephrase this to "Returns the first non-NULL value (if it exists) of the input column sorted by the sort_column. If sort_column is missing, the field_name column is used for sorting`. or something along those lines.
| ----------------------------------------------------------- | ||
|
|
||
| [NOTE] | ||
| `FIRST` cannot be used in to create a filter in a `HAVING` clause of a `GROUP BY` query. |
There was a problem hiding this comment.
I think "in" is not needed here?
Also, worth putting this as a limitation in the docs? (meaning, in the dedicated Limitations page)
There was a problem hiding this comment.
FIRST cannot be used in a HAVING clause (no need to specify GROUP BY as that's understood and also might be implicit).
| .Synopsis: | ||
| [source, sql] | ||
| -------------------------------------------------- | ||
| LAST(field_name<1>, sort_by_field_name<2>) |
There was a problem hiding this comment.
Same as above for optional fields and square brackets.
|
|
||
| .Description: | ||
|
|
||
| It's the inverse of <<sql-functions-aggs-first>>. When only one argument is provided it returns the |
There was a problem hiding this comment.
"first argument" instead of "one argument".
...rc/main/java/org/elasticsearch/xpack/sql/execution/search/extractor/TopHitsAggExtractor.java
Show resolved
Hide resolved
|
|
||
| import static org.elasticsearch.common.logging.LoggerMessageFormat.format; | ||
|
|
||
| public abstract class TopHits extends AggregateFunction { |
There was a problem hiding this comment.
Please, add a comment describing what the class is used for.
| List<SortBuilder<?>> sortBuilderList; | ||
| if (sortField != null) { | ||
| sortBuilderList = Arrays.asList( | ||
| new FieldSortBuilder(sortField).order(sortOrder).missing(Sort.Missing.LAST.position()) |
There was a problem hiding this comment.
And maybe import static for LAST would make the code a bit more readable.
| } | ||
| TopHitsAgg that = (TopHitsAgg) o; | ||
| return Objects.equals(sortField, that.sortField) && | ||
| sortOrder == that.sortOrder && |
There was a problem hiding this comment.
Picky, and I apologize, but please put the && at start of the line. I think this is the consistent approach for this in code overall.
| -------------------------------------------------- | ||
|
|
||
| [NOTE] | ||
| `MIN` on a field of type <<text, `text`>> or <<keyword, `keyword`>> is translated into <<sql-functions-aggs-first>>. |
There was a problem hiding this comment.
And I'm assuming MIN/MAX on keywords cannot be used in HAVINGs...
There was a problem hiding this comment.
Yes, that's why I made the link here. Do you think I should add the info about HAVING here too?
astefan
left a comment
There was a problem hiding this comment.
Left some minor comments.
| FIRST_VALUE(field_name<1>) | ||
|
|
||
| FIRST(field_name<1>, ordering_field_name<2>) | ||
| FIRST_VALUE(field_name<1>, ordering_field_name<2>) |
There was a problem hiding this comment.
Square brackets for optional fields, please.
There was a problem hiding this comment.
But now we list both with 1 arg and 2args, do you think is still necessary?
There was a problem hiding this comment.
If we do it like this (all versions listed - with/without the optional parameter), our documentation will be inconsistent: https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/string.asciidoc#locate. If I would have to choose between the two approaches, I would go with [ ].
There was a problem hiding this comment.
In my opinion you either have one signature with the 2nd arg in square brackets or you list the 2 variants. Personally I prefer the one signature and the square brackets. @costin what do you think?
There was a problem hiding this comment.
I opt for the signature with the second arguments in square brackets.
|
|
||
| *Input*: | ||
|
|
||
| <1> target field for the aggregation |
There was a problem hiding this comment.
"aggregation" refers to technical implementation in the background. My personal approach would be not to expose this, but try to explain what the field is used for from the end-user perspective. Again, personal preference.
There was a problem hiding this comment.
Not necessarily - for the aggregate function (aggregation is ES terminology, aggregate is SQL).
| LAST_VALUE(field_name<1>) | ||
|
|
||
| LAST(field_name<1>, ordering_field_name<2>) | ||
| LAST_VALUE(field_name<1>, ordering_field_name<2>) |
There was a problem hiding this comment.
Square brackets for optional fields, please.
| "\"explain\":false,\"docvalue_fields\":[{\"field\":\"keyword\",\"format\":\"use_field_mapping\"}]," + | ||
| "\"explain\":false,\"docvalue_fields\":[{\"field\":\"keyword\"}]," + | ||
| "\"sort\":[{\"int\":{\"order\":\"desc\",\"missing\":\"_last\",\"unmapped_type\":\"integer\"}}," + | ||
| "{\"keyword\":{\"order\":\"desc\",\"missing\":\"_last\",\"unmapped_type\":\"keyword\"}}]}}}}}")); |
There was a problem hiding this comment.
I cannot wrap my head around missing option for top_hits in the context of first/last/min/max... It's about how null/missing values are treated when sorting: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html#_missing_values. Is it ok to keep missing with the default (_last), or I'm missing something, or there are no tests that have missing: _first scenario included...?
There was a problem hiding this comment.
After I clicked "add comment" I realized these are about first/last non-NULL values. So yeah, we don't care about missing and we like the fact that the default is _last.
FIRST and LAST can be used with one argument and work similarly to MIN and MAX but they are implemented using a Top Hits aggregation and therefore can also operate on keyword fields. When a second argument is provided then they return the first/last value of the first arg when its values are ordered ascending/descending (respectively) by the values of the second argument. Currently because of the usage of a Top Hits aggregation FIRST and LAST cannot be used in the HAVING clause of a GROUP BY query to filter on the results of the aggregation. Closes: #35639
|
Backported to |
…ersion * elastic/master: Do not set up NodeAndClusterIdStateListener in test (elastic#38110) ML: better handle task state race condition (elastic#38040) Soft-deletes policy should always fetch latest leases (elastic#37940) Handle scheduler exceptions (elastic#38014) Minor logging improvements (elastic#38084) Fix Painless void return bug (elastic#38046) Update PutFollowAction serialization post-backport (elastic#37989) fix a few versionAdded values in ElasticsearchExceptions (elastic#37877) Reenable BWC tests after backport of elastic#37899 (elastic#38093) Mute failing test Mute failing test Fail start on obsolete indices documentation (elastic#37786) SQL: Implement FIRST/LAST aggregate functions (elastic#37936) Un-mute NoMasterNodeIT.testNoMasterActionsWriteMasterBlock remove unused parser fields in RemoteResponseParsers
Although the translation rule was implemented in the `Optimizer`, the rule was not added in the list of rules to be executed. Relates to elastic#41195 Follows elastic#37936
Although the translation rule was implemented in the `Optimizer`, the rule was not added in the list of rules to be executed. Relates to elastic#41195 Follows elastic#37936
FIRST and LAST can be used with one argument and work similarly to MIN
and MAX but they are implemented using a Top Hits aggregation and
therefore can also operate on keyword fields. When a second argument is
provided then they return the first/last value of the first arg when its
values are ordered ascending/descending (respectively) by the values of
the second argument. Currently because of the usage of a Top Hits
aggregation FIRST and LAST cannot be used in the HAVING clause of a
GROUP BY query to filter on the results of the aggregation.
Closes: #35639