Conversation
| // return false; | ||
| // } | ||
| // NOCOMMIT disabled security manager | ||
| return true; |
There was a problem hiding this comment.
We can't commit this. But I spent half a day messing with the security manager and gave up. We should be able to fix this.
| }; | ||
| } | ||
|
|
||
| static ChunkedRestResponseBody fromMany(ChunkedRestResponseBody first, Iterator<? extends ChunkedRestResponseBody> rest) { |
There was a problem hiding this comment.
This'll want javadoc. But I think it's generally useful.
| @Override | ||
| public ReleasableBytesReference encodeChunk(int sizeHint, Recycler<BytesRef> recycler) throws IOException { | ||
| try { | ||
| return current.encodeChunk(sizeHint, recycler); |
There was a problem hiding this comment.
I haven't double checked this, but I'm worried that this'll send a chunk over the wire no matter how big the reference is. If it's less than the sizeHint we should probably give the next one a chance.
| * Initialize the Arrow shim. Arrow does some interesting reflection stuff on | ||
| * initialization. We can avoid it if we | ||
| */ | ||
| public static void init() { |
There was a problem hiding this comment.
Arrow has some "interesting" code in the initialization. In an effort to be magic it looks around the in classpath and calls setAccessible on something. That something is public already, but the security manager still fails.
We can avoid the whole classpath scanning and reflection magic with some contained magic of our own - this stuff. But we need security manager privileges. So I moved this to it's own tiny jar.
Lastly, this replaces all the things arrow does with the Unsafe with a shim that does nothing. That's fine for how we're using it.
|
#109873 finished this one. |
Initial support for Apache Arrow's streaming format as a response for ES|QL. It triggers based on the Accept header or the format request parameter. Arrow has implementations in every mainstream language and is a backend of the Python Pandas library, which is extremely popular among data scientists and data analysts. Arrow's streaming format has also become the de facto standard for dataframe interchange. It is an efficient binary format that allows zero-cost deserialization by adding data access wrappers on top of memory buffers received from the network. This PR builds on the experiment made by @nik9000 in PR #104877 Features/limitations: - all ES|QL data types are supported - multi-valued fields are not supported - fields of type _source are output as JSON text in a varchar array. In a future iteration we may want to offer the choice of the more efficient CBOR and SMILE formats. Technical details: Arrow comes with its own memory management to handle vectors with direct memory, reference counting, etc. We don't want to use this as it conflicts with Elasticsearch's own memory management. We therefore use the Arrow library only for the metadata objects describing the dataframe schema and the structure of the streaming format. The Arrow vector data is produced directly from ES|QL blocks. --------- Co-authored-by: Nik Everett <nik9000@gmail.com>
Initial support for Apache Arrow's streaming format as a response for ES|QL. It triggers based on the Accept header or the format request parameter. Arrow has implementations in every mainstream language and is a backend of the Python Pandas library, which is extremely popular among data scientists and data analysts. Arrow's streaming format has also become the de facto standard for dataframe interchange. It is an efficient binary format that allows zero-cost deserialization by adding data access wrappers on top of memory buffers received from the network. This PR builds on the experiment made by @nik9000 in PR #104877 Features/limitations: - all ES|QL data types are supported - multi-valued fields are not supported - fields of type _source are output as JSON text in a varchar array. In a future iteration we may want to offer the choice of the more efficient CBOR and SMILE formats. Technical details: Arrow comes with its own memory management to handle vectors with direct memory, reference counting, etc. We don't want to use this as it conflicts with Elasticsearch's own memory management. We therefore use the Arrow library only for the metadata objects describing the dataframe schema and the structure of the streaming format. The Arrow vector data is produced directly from ES|QL blocks. --------- Co-authored-by: Nik Everett <nik9000@gmail.com>
Initial support for Apache Arrow's streaming format as a response for ES|QL. It triggers based on the Accept header or the format request parameter. Arrow has implementations in every mainstream language and is a backend of the Python Pandas library, which is extremely popular among data scientists and data analysts. Arrow's streaming format has also become the de facto standard for dataframe interchange. It is an efficient binary format that allows zero-cost deserialization by adding data access wrappers on top of memory buffers received from the network. This PR builds on the experiment made by @nik9000 in PR #104877 Features/limitations: - all ES|QL data types are supported - multi-valued fields are not supported - fields of type _source are output as JSON text in a varchar array. In a future iteration we may want to offer the choice of the more efficient CBOR and SMILE formats. Technical details: Arrow comes with its own memory management to handle vectors with direct memory, reference counting, etc. We don't want to use this as it conflicts with Elasticsearch's own memory management. We therefore use the Arrow library only for the metadata objects describing the dataframe schema and the structure of the streaming format. The Arrow vector data is produced directly from ES|QL blocks. --------- Co-authored-by: Nik Everett <nik9000@gmail.com>
This adds support for Apache Arrow's streaming
format as a response from ES|QL. It triggers based on the Accept
header or a
formatrequest parameter.This is neat because Arrow is an efficient format to read - as in the
processor has to do very little to take the data from the wire and put
it on the screen. Because Arrow is efficient folks are excited about
it and making it available in lots of tools. ES|QL could piggy back
on that. For example, to use ES|QL in a Jupyter Notebook with this PR
you'd do this:
That's about a third ESQL and half ceremony. The last three lines are
super dense:
That's a little
import antigravityantigravity. But that's great whenit works! And Kibana should be able to integrate with it. Probably not as
simply as Pandas but we can get it!
Arrow is kind of a perfect output format for ES|QL in that it's a
batched columnar format. The output from this PR looks like:
Each batch looks like:
Dense data like
intandlonganddoublealways have a VALIDITY andDATA buffer. Variable length data like
keywords have a VALIDITY, LENGTH,and DATA buffer. This format is pretty close to ES|QL's own in memory
representation. It's close. And, the best part is that ES|QL's internal
response is also batched. We have batches, and, if we're very lucky, we
could integrate this with Pauseable Chunked Responses (#104851). It
should integrate perfectly. Should.