One issue I keep hearing about is that it's too hard to define a runtime field that extracts some information from a message field with Painless. Something like extracting the HTTP status code from a log line of an Apache access log.
I think that this issue has been put into the general meta issue of "doing simple things with Painless should be simpler" but in my opinion this particular issue has more to do with mappings than with Painless. Historically, fielddata on analyzed string fields would uninvert the inverted index in memory and Elasticsearch would consider that the value of a field is the set of analyzed terms that it contains. This would require lots of memory, and over time we've increasingly discouraged users from doing it.
These semantics don't work well with runtime extraction of data. If you try to extract data using a regular expression that applies to doc['message'], you'll get an exception that fielddata is disabled by default on text fields. And even if Elasticsearch returned values, you'd get individual terms, which you cannot leverage to properly extract data from the message.
I suggest that we change the semantics of fielddata on fields of the text family (including text and match_only_text) so that it returns whole values instead. This will enable us to give a more intuitive experience with scripts, where doc could read data from _source on text fields (#80504).
Note that this brings a downside: in order to make it easy to slice and dice the data, Elasticsearch allows users to use terms produce by terms aggregations in term filters, in order to dig further data that falls within a given bucket. This would not work on text fields. I don't think it's the end of the world, since terms aggregations do not work on text fields today anyway given that we disallow fielddata, but I wanted to highlight it since it would create an exception to a rule that is otherwise honored by keyword, ip or numeric fields.
One issue I keep hearing about is that it's too hard to define a runtime field that extracts some information from a
messagefield with Painless. Something like extracting the HTTP status code from a log line of an Apache access log.I think that this issue has been put into the general meta issue of "doing simple things with Painless should be simpler" but in my opinion this particular issue has more to do with mappings than with Painless. Historically, fielddata on analyzed
stringfields would uninvert the inverted index in memory and Elasticsearch would consider that the value of a field is the set of analyzed terms that it contains. This would require lots of memory, and over time we've increasingly discouraged users from doing it.These semantics don't work well with runtime extraction of data. If you try to extract data using a regular expression that applies to
doc['message'], you'll get an exception that fielddata is disabled by default ontextfields. And even if Elasticsearch returned values, you'd get individual terms, which you cannot leverage to properly extract data from the message.I suggest that we change the semantics of fielddata on fields of the
textfamily (includingtextandmatch_only_text) so that it returns whole values instead. This will enable us to give a more intuitive experience with scripts, wheredoccould read data from_sourceontextfields (#80504).Note that this brings a downside: in order to make it easy to slice and dice the data, Elasticsearch allows users to use terms produce by
termsaggregations intermfilters, in order to dig further data that falls within a given bucket. This would not work ontextfields. I don't think it's the end of the world, sincetermsaggregations do not work ontextfields today anyway given that we disallow fielddata, but I wanted to highlight it since it would create an exception to a rule that is otherwise honored bykeyword,ipor numeric fields.