Work plan
ES|QL Field Loading for LogsAI
We'd like to enable the LogsAI folks to build a sort of "tree" of data streams with many fields left unmapped at index time and loaded at query time. ES|QL is capable of loading fields from source today, but only if they are mapped. Let's make this smooth for LogsAI users.
The Pretty Way
We expect to make some pretty syntax "insist" that a field exists which is a noop if the field exists and is mapped correctly, but otherwise will "force" the field to exist, either loading it from _source, or loading from an existing mapped field and casting its values on the fly. Imagine a syntax like:
FROM logs | INSIST field::LONG | WHERE field == 200
- If field is mapped as a
long then we push the WHERE clause to the index.
- If field is unmapped we load it from
_source, parse it to a long, and compare it for each document.
- If field is mapped as a
keyword then its like a normal keyword and parsed to a long and the comparison is done on the fly.
Note that the INSIST syntax is almost certainly not what we want in the end. But we want some syntax for this behavior that's short.
The hard way or: How we get there from here
The most natural way for ES|QL to support unmapped fields is to create a function to load from source. Something like:
FROM logs
| EVAL field = EXTRACT_FROM_SOURCE("field")
This is close already! If we just build EXTRACT_FROM_SOURCE we could really unblock some folks. The trouble is that ES|QL will always 'EXTRACT_FROM_SOURCE', even if field is mapped. And with the "tree" shaped mapping, we expect it to sometime be mapped.
But ES|QL is a whole language, so if we made an IS_MAPPED function we could load from _source if the field isn't mapped. Something like:
FROM logs
| EVAL field = CASE(IS_MAPPED("field"), field, EXTRACT_FROM_SOURCE("field"))
This doesn't handle the case where the field is mapped, but we already support that with conversion functions and union types. That'd just look like:
FROM logs
| EVAL field = CASE(IS_MAPPED("field"), field, EXTRACT_FROM_SOURCE("field"))::LONG
This is nice because ES|QL already has constant folding. So:
- If field is mapped as a
long, the query is rewritten to FROM logs.
- If field is mapped as a
keyword, the query is rewritten to FROM logs | EVAL field = field::long.
- If field is not mapped, the query is rewritten to:
FROM logs | EVAL field = EXTRACT_FROM_SOURCE("field")::long.
Those are pretty much exactly what we want! We get them by implementing two functions and a bunch of tests and maybe a rewrite rule.
The bow
No one is going to want to write out all that stuff for every field. I propose we make the syntax we decided on in the "The pretty way" section syntactic sugar for this CASE sequence. So
FROM logs | INSIST field::LONG | WHERE field == 200
Becomes
FROM logs
| EVAL field = CASE(IS_MAPPED("field"), field, EXTRACT_FROM_SOURCE("field"))::LONG
| WHERE field == 200
Performance considerations
Parsing _source is expensive! We should make sure we parse it as few times as we can manage.
First and foremost, you don't want to parse _source once per field you extract from it. We did that in very very old versions of ES|QL, long before GA. It was devastating to performance. We've since built a row-wise loading mechanism for extracting fields from source. We should figure out a way to piggy-back on that when running EXTRACT_FROM_SOURCE.
The point of this "tree" shape of data streams is to separate data that's very different from one another. If field exists in only a handful of documents it's going to be super inefficient to load and parse _source to figure that out. Maybe Elasticsearch could make a field that lists all of the unmapped field included in _source. It wouldn't be that expensive. There's a lot of neat things we could do with it, I think.
Work plan
INSISTparameter, without casting or conflict resolution (ESQL: Initial support for unmapped fields #119886).INSISTparameters, still without casting (ESQL: Initial support for unmapped fields #119886).INSISTclauses, one on top of the other (ESQL: Initial support for unmapped fields #119886).EVAL.EVALon top of theINSIST.INSISTclauses, i.e., not directly on top of aFROM.INSISTclauses when all indices are mapped. Note that we currently do not maintain this information during index retrieval, so it isn't as trivial as one might assume (ESQL: Initial support for unmapped fields #119886).ES|QL Field Loading for LogsAI
We'd like to enable the LogsAI folks to build a sort of "tree" of data streams with many fields left unmapped at index time and loaded at query time. ES|QL is capable of loading fields from source today, but only if they are mapped. Let's make this smooth for LogsAI users.
The Pretty Way
We expect to make some pretty syntax "insist" that a field exists which is a noop if the field exists and is mapped correctly, but otherwise will "force" the field to exist, either loading it from
_source, or loading from an existing mapped field and casting its values on the fly. Imagine a syntax like:longthen we push theWHEREclause to the index._source, parse it to along, and compare it for each document.keywordthen its like a normalkeywordand parsed to alongand the comparison is done on the fly.Note that the
INSISTsyntax is almost certainly not what we want in the end. But we want some syntax for this behavior that's short.The hard way or: How we get there from here
The most natural way for ES|QL to support unmapped fields is to create a function to load from source. Something like:
This is close already! If we just build
EXTRACT_FROM_SOURCEwe could really unblock some folks. The trouble is that ES|QL will always 'EXTRACT_FROM_SOURCE', even if field is mapped. And with the "tree" shaped mapping, we expect it to sometime be mapped.But ES|QL is a whole language, so if we made an
IS_MAPPEDfunction we could load from_sourceif the field isn't mapped. Something like:This doesn't handle the case where the field is mapped, but we already support that with conversion functions and union types. That'd just look like:
This is nice because ES|QL already has constant folding. So:
long, the query is rewritten toFROMlogs.keyword, the query is rewritten toFROM logs | EVAL field = field::long.FROM logs | EVAL field = EXTRACT_FROM_SOURCE("field")::long.Those are pretty much exactly what we want! We get them by implementing two functions and a bunch of tests and maybe a rewrite rule.
The bow
No one is going to want to write out all that stuff for every field. I propose we make the syntax we decided on in the "The pretty way" section syntactic sugar for this CASE sequence. So
Becomes
Performance considerations
Parsing
_sourceis expensive! We should make sure we parse it as few times as we can manage.First and foremost, you don't want to parse
_sourceonce per field you extract from it. We did that in very very old versions of ES|QL, long before GA. It was devastating to performance. We've since built a row-wise loading mechanism for extracting fields from source. We should figure out a way to piggy-back on that when runningEXTRACT_FROM_SOURCE.The point of this "tree" shape of data streams is to separate data that's very different from one another. If field exists in only a handful of documents it's going to be super inefficient to load and parse
_sourceto figure that out. Maybe Elasticsearch could make a field that lists all of the unmapped field included in _source. It wouldn't be that expensive. There's a lot of neat things we could do with it, I think.