Patch around pandas ArrowStringArray pickling#9024
Conversation
The (experimental) pandas `string[pyarrow]` dtype has some major performance benefits that we'd like to experiment with in dask. However, currently `pyarrow.StringArray` objects have a bug in their pickle implementation where a small slice of the array still serializes the full (potentially very large) backing buffers (see https://issues.apache.org/jira/browse/ARROW-10739). Hopefully this is fixed upstream in pyarrow at some point, but for now we patch around it by overriding the pickling implementation for `ArrowStringArray` in pandas. This implementation is efficient, resulting in zero-copy serialization in most cases. There is still more work to do to fully support the `string[pyarrow]` dtype, but I think this PR can go in as is for now.
|
cc @jorisvandenbossche - would it be worthwhile to upstream this patch to pandas, or should the real fix come at the pyarrow level? Either way we'll likely want to keep this in dask for a bit for backwards compatibility with older pandas/pyarrow, |
There was a problem hiding this comment.
Thanks @jcrist. I didn't give this a detailed review, but from a high level this LGTM. Like you said, there's still more work to do to fully support the string[pyarrow], but this looks like a clear improvement over the current situation
You've added tests for the new dask.dataframe._pyarrow_compat module, which is great. Is there Dask user-code that this PR now enables? If so, can we add some corresponding tests?
There is not. |
The (experimental) pandas `string[pyarrow]` dtype has some major performance benefits that we'd like to experiment with in dask. However, currently `pyarrow.StringArray` objects have a bug in their pickle implementation where a small slice of the array still serializes the full (potentially very large) backing buffers (see https://issues.apache.org/jira/browse/ARROW-10739). Hopefully this is fixed upstream in pyarrow at some point, but for now we patch around it by overriding the pickling implementation for `ArrowStringArray` in pandas. This implementation is efficient, resulting in zero-copy serialization in most cases. There is still more work to do to fully support the `string[pyarrow]` dtype, but I think this PR can go in as is for now.
The (experimental) pandas
string[pyarrow]dtype has some majorperformance benefits that we'd like to experiment with in dask. However,
currently
pyarrow.StringArrayobjects have a bug in their pickleimplementation where a small slice of the array still serializes the
full (potentially very large) backing buffers (see
https://issues.apache.org/jira/browse/ARROW-10739). Hopefully this is
fixed upstream in pyarrow at some point, but for now we patch around it
by overriding the pickling implementation for
ArrowStringArrayinpandas. This implementation is efficient, resulting in zero-copy
serialization in most cases.
There is still more work to do to fully support the
string[pyarrow]dtype, but I think this PR can go in as is for now.
Part of #8842.