Apache arrow support for ES|QL by nik9000 · Pull Request #104877 · elastic/elasticsearch

nik9000 · 2024-01-29T18:48:44Z

This adds support for Apache Arrow's streaming
format as a response from ES|QL. It triggers based on the Accept
header or a format request parameter.

This is neat because Arrow is an efficient format to read - as in the
processor has to do very little to take the data from the wire and put
it on the screen. Because Arrow is efficient folks are excited about
it and making it available in lots of tools. ES|QL could piggy back
on that. For example, to use ES|QL in a Jupyter Notebook with this PR
you'd do this:

import base64
import urllib3
import pyarrow as pa
import matplotlib.pyplot as plt
import pandas as pd
plt.close("all")
timeout = urllib3.Timeout(connect=2.0, read=300.0)
http = urllib3.PoolManager(timeout=timeout)

def basic(login, password):
  encoded = base64.b64encode((login + ':' + password).encode()).decode()
  return "Basic %s" % encoded

resp = http.request(
  "POST",
  "http://elastic:password@localhost:9200/_query?format=arrow",
  headers = {
    "authorization": basic("elastic", "password")
  },
  json = {
    "query": """
      FROM nyc_taxis 
    | WHERE trip_distance > 0 AND trip_distance < 100 AND fare_amount > 1
    | STATS min=MIN(fare_amount), avg=AVG(fare_amount), max=MAX(fare_amount) BY pickup_datetime=DATE_TRUNC(1 DAY, pickup_datetime)
    | EVAL LOG10(min), LOG10(max), LOG10(avg)
    | DROP min, max, avg
    | SORT pickup_datetime
    | LIMIT 100000"""
  })

if resp.status != 200:
  print("failed: %s" % resp.data)
else:
    with pa.ipc.open_stream(resp.data) as reader:
        df = reader.read_pandas()
        df.plot.line(x="pickup_datetime")

That's about a third ESQL and half ceremony. The last three lines are
super dense:

Convert from a stream of bytes into an arrow reader
Convert the reader into Pandaspandas
Plot the line using matplotlibmatplotlib

That's a little import antigravityantigravity. But that's great when
it works! And Kibana should be able to integrate with it. Probably not as
simply as Pandas but we can get it!

Arrow is kind of a perfect output format for ES|QL in that it's a
batched columnar format. The output from this PR looks like:

<SCHEMA>
<BATCH 0>
<BATCH 1>
<BATCH 2>
<BATCH 3>
...
<0xFFFFFFFF 0x00000000>

Each batch looks like:

<HEADER>
<VALIDITY 0>
<DATA 0>
<VALIDITY 1>
<DATA 1>
<VALIDITY 2>
<DATA 2>
<VALIDITY 3>
<LENGTH 3>
<DATA 3>

Dense data like int and long and double always have a VALIDITY and
DATA buffer. Variable length data like keywords have a VALIDITY, LENGTH,
and DATA buffer. This format is pretty close to ES|QL's own in memory
representation. It's close. And, the best part is that ES|QL's internal
response is also batched. We have batches, and, if we're very lucky, we
could integrate this with Pauseable Chunked Responses (#104851). It
should integrate perfectly. Should.

nik9000 · 2024-01-29T18:49:19Z

server/src/main/java/org/elasticsearch/bootstrap/BootstrapChecks.java

+            // return false;
+            // }
+            // NOCOMMIT disabled security manager
            return true;


We can't commit this. But I spent half a day messing with the security manager and gave up. We should be able to fix this.

nik9000 · 2024-01-29T18:49:50Z

server/src/main/java/org/elasticsearch/rest/ChunkedRestResponseBody.java

        };
    }
+
+    static ChunkedRestResponseBody fromMany(ChunkedRestResponseBody first, Iterator<? extends ChunkedRestResponseBody> rest) {


This'll want javadoc. But I think it's generally useful.

nik9000 · 2024-01-29T18:52:47Z

server/src/main/java/org/elasticsearch/rest/ChunkedRestResponseBody.java

+            @Override
+            public ReleasableBytesReference encodeChunk(int sizeHint, Recycler<BytesRef> recycler) throws IOException {
+                try {
+                    return current.encodeChunk(sizeHint, recycler);


I haven't double checked this, but I'm worried that this'll send a chunk over the wire no matter how big the reference is. If it's less than the sizeHint we should probably give the next one a chance.

nik9000 · 2024-01-29T18:55:40Z

x-pack/plugin/esql/arrow/shim/src/main/java/org/elasticsearch/xpack/esql/arrow/shim/Shim.java

+     * Initialize the Arrow shim. Arrow does some interesting reflection stuff on
+     * initialization. We can avoid it if we
+     */
+    public static void init() {


Arrow has some "interesting" code in the initialization. In an effort to be magic it looks around the in classpath and calls setAccessible on something. That something is public already, but the security manager still fails.

We can avoid the whole classpath scanning and reflection magic with some contained magic of our own - this stuff. But we need security manager privileges. So I moved this to it's own tiny jar.

Lastly, this replaces all the things arrow does with the Unsafe with a shim that does nothing. That's fine for how we're using it.

nik9000 · 2024-07-02T18:35:49Z

#109873 finished this one.

@nik9000

Initial support for Apache Arrow's streaming format as a response for ES|QL. It triggers based on the Accept header or the format request parameter. Arrow has implementations in every mainstream language and is a backend of the Python Pandas library, which is extremely popular among data scientists and data analysts. Arrow's streaming format has also become the de facto standard for dataframe interchange. It is an efficient binary format that allows zero-cost deserialization by adding data access wrappers on top of memory buffers received from the network. This PR builds on the experiment made by @nik9000 in PR #104877 Features/limitations: - all ES|QL data types are supported - multi-valued fields are not supported - fields of type _source are output as JSON text in a varchar array. In a future iteration we may want to offer the choice of the more efficient CBOR and SMILE formats. Technical details: Arrow comes with its own memory management to handle vectors with direct memory, reference counting, etc. We don't want to use this as it conflicts with Elasticsearch's own memory management. We therefore use the Arrow library only for the metadata objects describing the dataframe schema and the structure of the streaming format. The Arrow vector data is produced directly from ES|QL blocks. --------- Co-authored-by: Nik Everett <nik9000@gmail.com>

@nik9000

Initial support for Apache Arrow's streaming format as a response for ES|QL. It triggers based on the Accept header or the format request parameter. Arrow has implementations in every mainstream language and is a backend of the Python Pandas library, which is extremely popular among data scientists and data analysts. Arrow's streaming format has also become the de facto standard for dataframe interchange. It is an efficient binary format that allows zero-cost deserialization by adding data access wrappers on top of memory buffers received from the network. This PR builds on the experiment made by @nik9000 in PR #104877 Features/limitations: - all ES|QL data types are supported - multi-valued fields are not supported - fields of type _source are output as JSON text in a varchar array. In a future iteration we may want to offer the choice of the more efficient CBOR and SMILE formats. Technical details: Arrow comes with its own memory management to handle vectors with direct memory, reference counting, etc. We don't want to use this as it conflicts with Elasticsearch's own memory management. We therefore use the Arrow library only for the metadata objects describing the dataframe schema and the structure of the streaming format. The Arrow vector data is produced directly from ES|QL blocks. --------- Co-authored-by: Nik Everett <nik9000@gmail.com>

@nik9000

Initial support for Apache Arrow's streaming format as a response for ES|QL. It triggers based on the Accept header or the format request parameter. Arrow has implementations in every mainstream language and is a backend of the Python Pandas library, which is extremely popular among data scientists and data analysts. Arrow's streaming format has also become the de facto standard for dataframe interchange. It is an efficient binary format that allows zero-cost deserialization by adding data access wrappers on top of memory buffers received from the network. This PR builds on the experiment made by @nik9000 in PR #104877 Features/limitations: - all ES|QL data types are supported - multi-valued fields are not supported - fields of type _source are output as JSON text in a varchar array. In a future iteration we may want to offer the choice of the more efficient CBOR and SMILE formats. Technical details: Arrow comes with its own memory management to handle vectors with direct memory, reference counting, etc. We don't want to use this as it conflicts with Elasticsearch's own memory management. We therefore use the Arrow library only for the metadata objects describing the dataframe schema and the structure of the streaming format. The Arrow vector data is produced directly from ES|QL blocks. --------- Co-authored-by: Nik Everett <nik9000@gmail.com>

nik9000 added 6 commits January 24, 2024 08:23

WIP

0640e07

WIP2

1a1da9d

I think it works!

b1c384a

int

62eeff3

Strings work now

9edfd1b

Dates

9abe1f7

elasticsearchmachine added the v8.13.0 label Jan 29, 2024

nik9000 commented Jan 29, 2024

View reviewed changes

costin added the WIP label Jan 30, 2024

nik9000 mentioned this pull request Jan 30, 2024

Pausable chunked HTTP responses #104851

Merged

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

stratoula mentioned this pull request Mar 12, 2024

[ES|QL] [Visualizations] Support of arrow format elastic/kibana#178471

Closed

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

thomasneirynck mentioned this pull request May 29, 2024

[Research] Data format improvements for charting (arrow) elastic/kibana#175695

Closed

stratoula added ES|QL-ui Impacts ES|QL UI and removed ES|QL-ui Impacts ES|QL UI labels May 30, 2024

JoshMock mentioned this pull request Jun 4, 2024

ES|QL Apache Arrow Support elastic/elasticsearch-js#2269

Closed

swallez mentioned this pull request Jun 18, 2024

ESQL: add Arrow dataframes output format #109873

Merged

nik9000 closed this Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache arrow support for ES|QL#104877

Apache arrow support for ES|QL#104877
nik9000 wants to merge 6 commits intoelastic:mainfrom
nik9000:arrow

nik9000 commented Jan 29, 2024

Uh oh!

nik9000 Jan 29, 2024

Uh oh!

nik9000 Jan 29, 2024

Uh oh!

nik9000 Jan 29, 2024

Uh oh!

nik9000 Jan 29, 2024

Uh oh!

nik9000 commented Jul 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nik9000 commented Jan 29, 2024

Uh oh!

nik9000 Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

nik9000 Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

nik9000 Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

nik9000 Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Jul 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants