Patch around pandas ArrowStringArray pickling by jcrist · Pull Request #9024 · dask/dask

jcrist · 2022-05-04T21:35:00Z

The (experimental) pandas string[pyarrow] dtype has some major
performance benefits that we'd like to experiment with in dask. However,
currently pyarrow.StringArray objects have a bug in their pickle
implementation where a small slice of the array still serializes the
full (potentially very large) backing buffers (see
https://issues.apache.org/jira/browse/ARROW-10739). Hopefully this is
fixed upstream in pyarrow at some point, but for now we patch around it
by overriding the pickling implementation for ArrowStringArray in
pandas. This implementation is efficient, resulting in zero-copy
serialization in most cases.

There is still more work to do to fully support the string[pyarrow]
dtype, but I think this PR can go in as is for now.

Part of #8842.

The (experimental) pandas `string[pyarrow]` dtype has some major performance benefits that we'd like to experiment with in dask. However, currently `pyarrow.StringArray` objects have a bug in their pickle implementation where a small slice of the array still serializes the full (potentially very large) backing buffers (see https://issues.apache.org/jira/browse/ARROW-10739). Hopefully this is fixed upstream in pyarrow at some point, but for now we patch around it by overriding the pickling implementation for `ArrowStringArray` in pandas. This implementation is efficient, resulting in zero-copy serialization in most cases. There is still more work to do to fully support the `string[pyarrow]` dtype, but I think this PR can go in as is for now.

jcrist · 2022-05-04T21:38:34Z

cc @jorisvandenbossche - would it be worthwhile to upstream this patch to pandas, or should the real fix come at the pyarrow level? Either way we'll likely want to keep this in dask for a bit for backwards compatibility with older pandas/pyarrow,

dask/dataframe/_pyarrow_compat.py

jrbourbeau

Thanks @jcrist. I didn't give this a detailed review, but from a high level this LGTM. Like you said, there's still more work to do to fully support the string[pyarrow], but this looks like a clear improvement over the current situation

You've added tests for the new dask.dataframe._pyarrow_compat module, which is great. Is there Dask user-code that this PR now enables? If so, can we add some corresponding tests?

jcrist · 2022-05-05T16:25:40Z

Is there Dask user-code that this PR now enables? If so, can we add some corresponding tests?

There is not.

ian-r-rose

Thanks @jcrist

The (experimental) pandas `string[pyarrow]` dtype has some major performance benefits that we'd like to experiment with in dask. However, currently `pyarrow.StringArray` objects have a bug in their pickle implementation where a small slice of the array still serializes the full (potentially very large) backing buffers (see https://issues.apache.org/jira/browse/ARROW-10739). Hopefully this is fixed upstream in pyarrow at some point, but for now we patch around it by overriding the pickling implementation for `ArrowStringArray` in pandas. This implementation is efficient, resulting in zero-copy serialization in most cases. There is still more work to do to fully support the `string[pyarrow]` dtype, but I think this PR can go in as is for now.

github-actions bot added the dataframe label May 4, 2022

Support older versions of pandas

848b067

ian-r-rose reviewed May 5, 2022

View reviewed changes

dask/dataframe/_pyarrow_compat.py Show resolved Hide resolved

dask/dataframe/_pyarrow_compat.py Show resolved Hide resolved

dask/dataframe/_pyarrow_compat.py Show resolved Hide resolved

jrbourbeau reviewed May 5, 2022

View reviewed changes

Respond to feedback

f249125

ian-r-rose approved these changes May 5, 2022

View reviewed changes

jrbourbeau merged commit c6e9cc0 into dask:main May 5, 2022

jcrist deleted the patch-pandas-stringarray-pickle branch May 5, 2022 18:25

jrbourbeau mentioned this pull request Jun 6, 2022

Pyarrow string bug #9163

Closed

ian-r-rose mentioned this pull request Oct 31, 2022

Use upstream pandas pickling protocol for pyarrow string series #9613

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Patch around pandas ArrowStringArray pickling#9024

Patch around pandas ArrowStringArray pickling#9024
jrbourbeau merged 3 commits intodask:mainfrom
jcrist:patch-pandas-stringarray-pickle

jcrist commented May 4, 2022

Uh oh!

jcrist commented May 4, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbourbeau left a comment •

edited

Loading

Uh oh!

jcrist commented May 5, 2022

Uh oh!

ian-r-rose left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jcrist commented May 4, 2022

Uh oh!

jcrist commented May 4, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbourbeau left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist commented May 5, 2022

Uh oh!

ian-r-rose left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jrbourbeau left a comment •

edited

Loading

ian-r-rose left a comment •

edited

Loading