Skip to content

Support for unmapped fields #120072

@GalLalouche

Description

@GalLalouche

Work plan

ES|QL Field Loading for LogsAI

We'd like to enable the LogsAI folks to build a sort of "tree" of data streams with many fields left unmapped at index time and loaded at query time. ES|QL is capable of loading fields from source today, but only if they are mapped. Let's make this smooth for LogsAI users.

The Pretty Way

We expect to make some pretty syntax "insist" that a field exists which is a noop if the field exists and is mapped correctly, but otherwise will "force" the field to exist, either loading it from _source, or loading from an existing mapped field and casting its values on the fly. Imagine a syntax like:

FROM logs | INSIST field::LONG | WHERE field == 200
  • If field is mapped as a long then we push the WHERE clause to the index.
  • If field is unmapped we load it from _source, parse it to a long, and compare it for each document.
  • If field is mapped as a keyword then its like a normal keyword and parsed to a long and the comparison is done on the fly.

Note that the INSIST syntax is almost certainly not what we want in the end. But we want some syntax for this behavior that's short.

The hard way or: How we get there from here

The most natural way for ES|QL to support unmapped fields is to create a function to load from source. Something like:

FROM logs
| EVAL field = EXTRACT_FROM_SOURCE("field")

This is close already! If we just build EXTRACT_FROM_SOURCE we could really unblock some folks. The trouble is that ES|QL will always 'EXTRACT_FROM_SOURCE', even if field is mapped. And with the "tree" shaped mapping, we expect it to sometime be mapped.

But ES|QL is a whole language, so if we made an IS_MAPPED function we could load from _source if the field isn't mapped. Something like:

FROM logs
| EVAL field = CASE(IS_MAPPED("field"), field, EXTRACT_FROM_SOURCE("field"))

This doesn't handle the case where the field is mapped, but we already support that with conversion functions and union types. That'd just look like:

FROM logs
| EVAL field = CASE(IS_MAPPED("field"), field, EXTRACT_FROM_SOURCE("field"))::LONG

This is nice because ES|QL already has constant folding. So:

  • If field is mapped as a long, the query is rewritten to FROM logs.
  • If field is mapped as a keyword, the query is rewritten to FROM logs | EVAL field = field::long.
  • If field is not mapped, the query is rewritten to: FROM logs | EVAL field = EXTRACT_FROM_SOURCE("field")::long.

Those are pretty much exactly what we want! We get them by implementing two functions and a bunch of tests and maybe a rewrite rule.

The bow

No one is going to want to write out all that stuff for every field. I propose we make the syntax we decided on in the "The pretty way" section syntactic sugar for this CASE sequence. So

FROM logs | INSIST field::LONG | WHERE field == 200

Becomes

FROM logs
| EVAL field = CASE(IS_MAPPED("field"), field, EXTRACT_FROM_SOURCE("field"))::LONG
| WHERE field == 200

Performance considerations

Parsing _source is expensive! We should make sure we parse it as few times as we can manage.

First and foremost, you don't want to parse _source once per field you extract from it. We did that in very very old versions of ES|QL, long before GA. It was devastating to performance. We've since built a row-wise loading mechanism for extracting fields from source. We should figure out a way to piggy-back on that when running EXTRACT_FROM_SOURCE.

The point of this "tree" shape of data streams is to separate data that's very different from one another. If field exists in only a handful of documents it's going to be super inefficient to load and parse _source to figure that out. Maybe Elasticsearch could make a field that lists all of the unmapped field included in _source. It wouldn't be that expensive. There's a lot of neat things we could do with it, I think.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions