Skip to content

[ES|QL] Support first(), last() functions #108385

@wchaparro

Description

@wchaparro

Support aggregate functions - FIRST(<field>, <@timestamp>) and LAST(<field>, <@timestamp>)

Open questions - Available in Tech Preview

  • Do we cover sorting by things other than dates? [ES|QL] Selecting top count for an aggregation doesn't seem to work as expected #128612 + [ES|QL] Support first(), last() functions #108385 (comment)

    • All sortable types. Eventually. Good enough to do just longs for a while.
    • The first parameter to either FIRST or LAST (i.e.: <field>) shall support the following types:
      • long / int / double
      • keyword / text
      • boolean
      • IP
  • If <field> is NULL at the highest sort key, do we return that, or the value at the next highest sort key?

    • Security wants the behavior above. But does everyone? No. Let's allow nulls.
  • What about ties in the sort key? The prototype presumes it's allowed to return any of the values with the lowest sort key.

    • Figure out what sort does and probably do the same thing.
      • If we push to Lucene AND there is one shard THEN we do break ties! We use the doc _id. Which isn't perfect, but it's something.
      • Otherwise it's not consistent which one you get.
    • We might be ok if the sort isn't stable.
    • We could use the VALUE as a tie breaker.
    • For now, I'm going to return any value.
  • What about a multivalued sort key?

    • Should first use the minimum value and last use the maximum value? That's what the prototype does and it feels quite natural. It's as though FIRST(<field>, <@timestamp>) was implied to be FIRST(<field>, MV_MIN(<@timestamp>)).
    • We should double check what we do for sort and do that.
  • What if the value at the min sort key is multivalued? The prototype returns one of those values.

    • It's much less efficient if we have to return multivalued results. And the security folks know for lots of uses that the value is always single valued.
    • We could keep all of the speed the prototype has by specializing MV_FIRST(FIRST(<field>, <@timestamp>)) or FIRST(MV_FIRST(<field>), <@timestamp>).
    • We should return multivalued fields.

Open questions - Not Required for Tech Preview

  • Should @timestamp be implied if not provided? So you could call it with FIRST(<field>).
    • We think we should make an alias - EARLIEST/LATEST - that defaults to @timestamp.
  • Is FIRST and LAST the right name? Splunk has EARLIEST and LATEST for this concept. SQL has a FIRST_VALUE and LAST_VALUE but it's very window-function-y. And we're not window-function-y.
    • Leave the name FIRST/LAST and make a shortcut for EARLIEST/LATEST that operates on just @timestamp.

Tests we'll need

  • Huge number of multivalues in <field> and <@timestamp>
  • Parameter for the function name
  • The regular tests that usually ship with scalar functions
  • Unit tests for each type of aggregators, for both ungrouped and grouped cases. E.g.: FirstLongByTimestampAggregatorFunctionTests and grouping equivalent.

Things to circle back to after tech preview

  • Vector-only dispatch.
    Special handling for block values that can be vectorized has been disabled in #138390; it was a no-op. At some point, we'd like to enable it back and actually have it implemented for performance reasons as the vector path won't have to check for nulls, or mult-ivalues.
  • Add support for date / data_nanos / float in the value parameter
  • Make aliases that default to @timestamp. Call them EARLIEST/LATEST.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions