Skip to content

ESQL: Union Types Support (take2)#107255

Closed
craigtaverner wants to merge 38 commits intoelastic:mainfrom
craigtaverner:union_types_take2
Closed

ESQL: Union Types Support (take2)#107255
craigtaverner wants to merge 38 commits intoelastic:mainfrom
craigtaverner:union_types_take2

Conversation

@craigtaverner
Copy link
Copy Markdown
Contributor

@craigtaverner craigtaverner commented Apr 9, 2024

If the query sources multiple indexes, and the same field exists in multiple indexes with different types, this would normally fail the query. However, if the query includes a conversion function to resolve the field to a single type before it is used in other functions or aggregations, then this should work.

The following query works in this second prototype:

FROM sample_data* METADATA _index
| EVAL client_ip = TO_IP(client_ip)
| KEEP _index, @timestamp, client_ip, event_duration, message
| SORT _index ASC, @timestamp DESC

The client_ip field is an IP in the sample_data index, but a keyword in the sample_data_str index.

The first prototype did stuff to the drivers to create an index specific DriverContext to use during field evaluator construction so that the conversion function would be index/type aware. However, that abuses the idea of multi-threaded drivers. So this new approach instead re-plans the logical plan to extract the converter from the EVAL expressions, setting them as resolved (claiming the input type is already the converted type), and stores the converter in the EsRelation for later use in Physical planning. We no longer need an aggregation to force the conversion function onto the data node, as the 'real' conversion is now done at field extraction time using the converter function previously saved in the EsRelation and replanned into the EsQueryExec.

Fixes #100603

Tasks to do:

  • Develop prototype that plans type conversion from the eval to the field extraction
  • Cleanup classes and types
    • Move MultiTypeField to esql/types
    • Add serialization of MultiTypeField to PlanNamedTypes for multi-node cluster support
  • Generate correct error messages when plan cannot be resolved (mimic previous behaviour)
  • Support rowStrideReader (probably TypeConvertingBlockLoader.builder needs to wrap the delegate with a converting function)
  • Change solution to always create a new field (same name, different id) to not clash with existing field semantics
  • Support conversion expression in any part of the plan (not just EVAL)
  • More tests cases
    • get union_types.csv-spec to work in CsvTests (or disable like ENRICH)
    • Test with conversion function in WHERE clause
    • test more complex cases (multiple evals, convert types back and forth, with and without stats)
    • tests covering row-stride-reader (synthetic source?)
  • Documentation

@craigtaverner craigtaverner added >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL labels Apr 9, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @craigtaverner, I've created a changelog YAML for you.

nik9000 and others added 23 commits April 10, 2024 10:59
Landing elastic#107018 broke running ESQL unit tests in IntelliJ. It has
*something* to do with turning on the stringtemplate plugin in the esql
project but I don't really know what. After that PR we'd often get
errors about trying to regenerate evaluators twice. I dunno. This fixes
it. But I don't really know why.

The way this fixes it is by making the `esql` project more like the
`copmute` project. It makes sense that that would help - they both have
the same code generation configuration. Anyway, the operative change is
landing the generated files in the same place as the `compute` project.
Thus all of the file moves.

Again, I have no idea why this works. It's build black magic and I just
shook it until it worked. Most of the credit goes to git-bisect for
finding the commit that broke this.
Specifying `?master_timeout=-1` on an API which performs a cluster state
update means that the cluster state update task will never time out
while waiting in the pending tasks queue. However this parameter is also
re-used in a few places where a timeout of `-1` means something else,
typically to timeout immediately. This commit fixes those places so that
`?master_timeout=-1` consistently means to wait forever.
👋 @dakrone mentioned baby-update for docs. 

Sometimes users believe they should point client-side (e.g. Logstash's Elasticsearch output `index`) to the bootstrapped index. This highlights they point ingest towards the alias as we're expecting.
This PR adds the ability to modify the failure store indices on a data
stream using the modify data stream API.

These options are available in the event that we need to pull indices
out of a failure store or add them back to the failure store for any
reason. The operations are done using the existing modify data stream
actions with a new flag on the action body to denote if the action
should be done on the failure stores or not.
`.` is not a wildcard character....

Closes elastic#106791
On some systems Java appears to return amd64 (even if not an amd
processor), but on others it returns x86_64. This commit handles the
latter case to correctly associate the arch with the appropriate
platform dir.
* Change rerank result types, add named writables, new transportVersion

* Add description for namedWriteables

* fix RankedDocsResults toString

---------

Co-authored-by: David Kyle <david.kyle@elastic.co>
This commit allows the cli access to sending SIGKILL to the underlying
jvm process.
* string literal casting for scalar functions and arithmetic operations.
This adds a new SPI based `LoggingDataProvider` service that can be
implemented in order to add new fields to the main JSON log
For now this just sends a random version; later, we will want to specify
applicable versions in the csv tests themselves.
This is no longer needed in testRollupIndex, no test failures over a
week.

Fixes elastic#105437 Related to elastic#105485
jedrazb and others added 8 commits April 11, 2024 18:20
We've run into heap dumps that had instances of this class consume tens
and in one case more than a hundred MB of heap.
It seems reasonable to use a thread-local for the `tmp` long array
and trade the cost of looking up the thread-local for the memory
savings and cycles saved for allocating and assigning instances.
The first prototype did stuff to the drivers to create an index specific DriverContext to use during field evaluator construction so that the conversion function would be index/type aware. However, that abuses the idea of multi-threaded drivers. So this new approach instead re-plans the logical plan to extract the converter from the EVAL expressions, setting them as resolved (claiming the input type is already the converted type), and stores the converter in the EsRelation for later use in Physical planning.

The following query works in this second prototype:

```
multiIndexIpString
FROM sample_data* METADATA _index
| EVAL client_ip = TO_IP(client_ip)
| KEEP _index, @timestamp, client_ip, event_duration, message
| SORT _index ASC, @timestamp DESC
```

We no longer need an aggregation to force the conversion function onto the data node, as the 'real' conversion is now done at field extraction time using the converter function previously saved in the EsRelation and replanned into the EsQueryExec.
So the plan can be communicated across the network to data nodes.
The resolution is almost identical, but not only do we move it to the analyser, so failed resolutions will still generate errors, but instead of replacing all previous unresolved MultiTypeField with EsField, we replace them with resolved MultiTypeField, so we no longer need a special EsUnionTypesQueryExec for finding the union types from the FieldExractExec, because the attributes already in that object contain the necessary information for type resolution.
When multiple conversions functions are required, for multiple fields with union types, we had a bug in the fields replacement code.

These tests also assert on the correct error messages when type conversion does not occur. For now we use the same error message was had before union types was introduced.
This was for debugging only, and now we have much better negative test coverage in the yaml tests 140_union_types.yml
@craigtaverner craigtaverner mentioned this pull request Apr 16, 2024
21 tasks
@craigtaverner
Copy link
Copy Markdown
Contributor Author

Replaced by #107545

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.14.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ESQL: Mismatching mapping cannot be worked around with eval (Support for union types)