Skip to content

Patch around pandas ArrowStringArray pickling#9024

Merged
jrbourbeau merged 3 commits intodask:mainfrom
jcrist:patch-pandas-stringarray-pickle
May 5, 2022
Merged

Patch around pandas ArrowStringArray pickling#9024
jrbourbeau merged 3 commits intodask:mainfrom
jcrist:patch-pandas-stringarray-pickle

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented May 4, 2022

The (experimental) pandas string[pyarrow] dtype has some major
performance benefits that we'd like to experiment with in dask. However,
currently pyarrow.StringArray objects have a bug in their pickle
implementation where a small slice of the array still serializes the
full (potentially very large) backing buffers (see
https://issues.apache.org/jira/browse/ARROW-10739). Hopefully this is
fixed upstream in pyarrow at some point, but for now we patch around it
by overriding the pickling implementation for ArrowStringArray in
pandas. This implementation is efficient, resulting in zero-copy
serialization in most cases.

There is still more work to do to fully support the string[pyarrow]
dtype, but I think this PR can go in as is for now.

Part of #8842.

The (experimental) pandas `string[pyarrow]` dtype has some major
performance benefits that we'd like to experiment with in dask. However,
currently `pyarrow.StringArray` objects have a bug in their pickle
implementation where a small slice of the array still serializes the
full (potentially very large) backing buffers (see
https://issues.apache.org/jira/browse/ARROW-10739). Hopefully this is
fixed upstream in pyarrow at some point, but for now we patch around it
by overriding the pickling implementation for `ArrowStringArray` in
pandas. This implementation is efficient, resulting in zero-copy
serialization in most cases.

There is still more work to do to fully support the `string[pyarrow]`
dtype, but I think this PR can go in as is for now.
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented May 4, 2022

cc @jorisvandenbossche - would it be worthwhile to upstream this patch to pandas, or should the real fix come at the pyarrow level? Either way we'll likely want to keep this in dask for a bit for backwards compatibility with older pandas/pyarrow,

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jcrist. I didn't give this a detailed review, but from a high level this LGTM. Like you said, there's still more work to do to fully support the string[pyarrow], but this looks like a clear improvement over the current situation

You've added tests for the new dask.dataframe._pyarrow_compat module, which is great. Is there Dask user-code that this PR now enables? If so, can we add some corresponding tests?

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented May 5, 2022

Is there Dask user-code that this PR now enables? If so, can we add some corresponding tests?

There is not.

Copy link
Copy Markdown
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jcrist

@jrbourbeau jrbourbeau merged commit c6e9cc0 into dask:main May 5, 2022
@jcrist jcrist deleted the patch-pandas-stringarray-pickle branch May 5, 2022 18:25
erayaslan pushed a commit to erayaslan/dask that referenced this pull request May 12, 2022
The (experimental) pandas `string[pyarrow]` dtype has some major
performance benefits that we'd like to experiment with in dask. However,
currently `pyarrow.StringArray` objects have a bug in their pickle
implementation where a small slice of the array still serializes the
full (potentially very large) backing buffers (see
https://issues.apache.org/jira/browse/ARROW-10739). Hopefully this is
fixed upstream in pyarrow at some point, but for now we patch around it
by overriding the pickling implementation for `ArrowStringArray` in
pandas. This implementation is efficient, resulting in zero-copy
serialization in most cases.

There is still more work to do to fully support the `string[pyarrow]`
dtype, but I think this PR can go in as is for now.
@jrbourbeau jrbourbeau mentioned this pull request Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants