ARROW-1854: [Python] Use pickle to serialize numpy arrays of objects. #1360

robertnishihara · 2017-11-26T07:04:23Z

Just posting this for discussion. See the preceding discussion on https://issues.apache.org/jira/browse/ARROW-1854.

I think the ideal way to solve this would actually be to improve our handling of lists, which should be possible given that pickle seems to outperform us by 6x according to the benchmarks in https://issues.apache.org/jira/browse/ARROW-1854.

Note that the implementation in this PR will not handle numpy arrays of user-defined classes because it will not fall back to cloudpickle when needed.

cc @pcmoritz @wesm

wesm · 2017-11-26T16:46:30Z

python/pyarrow/serialization.py

This could be made even more efficient by converting this to a Buffer so it's sent in a separate sidecar (for zero copy until pickle.loads is called). I will play around with this and see what the performance is like

wesm · 2017-11-26T18:47:03Z

I made some minor tweaks to send the pickle as a buffer rather than packing the bytes into the union:

>>> arr = np.array(['foo', 'bar', None] * 100000, dtype=object)

>>> %timeit serialized = pa.serialize(arr)
4.66 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit pickle.dumps(arr, protocol=pickle.HIGHEST_PROTOCOL)
4.53 ms ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This seems pretty acceptable to me. Without this patch we have on my machine

>>> %timeit serialized = pa.serialize(arr)
24.1 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The savings become more significant when there are repeated Python objects, I presume.

How significant is the non-importable user-defined class issue?

pcmoritz · 2017-11-26T21:31:28Z

That's a very nice speedup!

However, I'd vote against merging this for now. It seems not that crucial for numpy arrays of custom objects to be fast and it is better to be general. Also I'd like to fall back to pickle as little as possible so we can transfer data between languages easily in the future.

If somebody runs into this being too slow of course we should reinvestigate. There are some speedups possible even if we don't fall back to pickle (like getting rid of the temporary copy of the list and using an iterator instead).

wesm · 2017-11-26T21:37:12Z

So the context is ARROW-1783 and ARROW-1784. The goal is to be able to send pandas objects over the wire, and these frequently (anytime you have strings, for example) have object arrays internally. So having object arrays going 3-6x slower than pickle is a significant issue.

pcmoritz · 2017-11-26T22:03:31Z

Ok, that is reasonable. Let me see if I can speed this up without falling back to pickle.

wesm · 2017-11-26T22:12:26Z

Probably the bigger issue is that Python objects are not being deduplicated like they are in pickle. I would at least like to have the option of using pickle for these arrays if we cannot get within 20% or the performance of pickle

pcmoritz · 2017-11-26T22:20:49Z

Yeah I agree with this! Ideally we would also do the deduplication, this is coming up from time to time, see also https://issues.apache.org/jira/browse/ARROW-1382.

robertnishihara · 2017-11-27T04:20:00Z

@wesm, you're right, the overhead comes from the fact that pyarrow.serialize doesn't handle duplication well. In the case where there is little or no duplication, pyarrow.serialize seems to outperform pickle.dumps. For example, see the following:

import numpy as np
import pickle
import pyarrow as pa

arr = np.array([str(i) for i in range(300000)], dtype=object)

%timeit pickle.dumps(arr)
41.9 ms ± 750 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pa.serialize(arr).to_buffer()
30.9 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Given that the current code works in all cases (as far as I can tell) and is more performant in some cases, I still prefer the current code.

However, if duplication is the common case for dataframes, then your optimization makes sense. In that case, would it be possible to move this code into the custom serializer for pandas dataframes instead of for numpy arrays? Or is that infeasible? If that is infeasible, I'd prefer to enable/disable this optimization with some sort of configuration flag because any reasonable cases that we can't handle will lead to complaints.

Longer term, maybe this will all be solved by handling duplication properly.

wesm · 2017-11-27T04:21:34Z

Do you have some object arrays where this might cause incompatibilities? I am fine with a flag to turn on this behavior, will have to think about what to call it

pcmoritz · 2017-11-27T04:22:32Z

Yeah I came to the same conclusion. I'd like to implement duplication at some point on the serialization level but it certainly won't happen before the release.

Meanwhile a custom serialization context or a flag are both good solutions.

robertnishihara · 2017-11-27T04:42:35Z

I don't have any example arrays at the moment. However, it feels like the kind of thing that will come up.

A custom serialization context makes sense to me (or having the downstream application register a more performant but less general custom serializer/deserializer).

@wesm In the scenario you're working on, are these numpy arrays of objects only being created by the pandas custom serializers? Or are they coming from somewhere else? If this mostly arises from pandas, handling this in the custom pandas serializers might solve the problem.

wesm · 2017-11-27T04:44:32Z

Right, the idea is that we're serializing the internal NumPy arrays in a pandas DataFrame. So the downside is that Dask will have to bother with custom serializers, but it's not such a big deal (since it will need to feature some custom handling of pyarrow.Buffer anyway)

…re memory-efficient transport Change-Id: I751e5a21201a501a508873174a76355d06282089

…xt member that uses pickle for NumPy arrays with unsupported tensor types Change-Id: Ia70c26954ff9ab3af435281bcafbf298c8c0cf28

Change-Id: Ibb0a48a8569d25ae887c267546c551beb74a8d0f

wesm · 2017-11-27T21:56:02Z

OK, I added a pandas_serialization_context variable, if that sounds OK

robertnishihara · 2017-11-27T21:57:02Z

Looks good to me!

pcmoritz · 2017-11-27T23:11:40Z

+1 LGTM once the test failure is fixed

Change-Id: I45bbd9d67c5193e1d1fdc9a9ca59c61c103e5b65

@pcmoritz

**Just posting this for discussion.** See the preceding discussion on https://issues.apache.org/jira/browse/ARROW-1854. I think the ideal way to solve this would actually be to improve our handling of lists, which should be possible given that pickle seems to outperform us by 6x according to the benchmarks in https://issues.apache.org/jira/browse/ARROW-1854. Note that the implementation in this PR will not handle numpy arrays of user-defined classes because it will not fall back to cloudpickle when needed. cc @pcmoritz @wesm Author: Wes McKinney <wes.mckinney@twosigma.com> Author: Robert Nishihara <robertnishihara@gmail.com> Closes #1360 from robertnishihara/numpyobject and squashes the following commits: c37a0a0 [Wes McKinney] Fix flake 5191503 [Wes McKinney] Fix post rebase 43f2c80 [Wes McKinney] Add SerializationContext.clone method. Add pandas_serialization_context member that uses pickle for NumPy arrays with unsupported tensor types c944023 [Wes McKinney] Use pickle.HIGHEST_PROTOCOL, convert to Buffer then memoryview for more memory-efficient transport cf719c3 [Robert Nishihara] Use pickle to serialize numpy arrays of objects.

wesm reviewed Nov 26, 2017

View reviewed changes

wesm changed the title ~~[DO NOT MERGE] Use pickle to serialize numpy arrays of objects.~~ ARROW-1854: [Python] Use pickle to serialize numpy arrays of objects. Nov 26, 2017

robertnishihara and others added 4 commits November 27, 2017 16:52

Use pickle to serialize numpy arrays of objects.

cf719c3

Use pickle.HIGHEST_PROTOCOL, convert to Buffer then memoryview for mo…

c944023

…re memory-efficient transport Change-Id: I751e5a21201a501a508873174a76355d06282089

Add SerializationContext.clone method. Add pandas_serialization_conte…

43f2c80

…xt member that uses pickle for NumPy arrays with unsupported tensor types Change-Id: Ia70c26954ff9ab3af435281bcafbf298c8c0cf28

Fix post rebase

5191503

Change-Id: Ibb0a48a8569d25ae887c267546c551beb74a8d0f

Fix flake

c37a0a0

Change-Id: I45bbd9d67c5193e1d1fdc9a9ca59c61c103e5b65

wesm closed this in 0d6f5bf Nov 29, 2017

pcmoritz mentioned this pull request Jan 11, 2018

Update arrow with better dataframe serialization and get rid of custo… ray-project/ray#1413

Merged

robertnishihara deleted the numpyobject branch February 8, 2018 07:51

robertnishihara mentioned this pull request Feb 8, 2018

Speeding up serialization of pandas dataframes. ray-project/ray#1527

Closed

asfimport mentioned this pull request Dec 15, 2017

[Python] Improve performance of serializing object dtype ndarrays #17848

Closed

ARROW-1854: [Python] Use pickle to serialize numpy arrays of objects. #1360

ARROW-1854: [Python] Use pickle to serialize numpy arrays of objects. #1360

Uh oh!

Conversation

robertnishihara commented Nov 26, 2017

Uh oh!

wesm Nov 26, 2017

Choose a reason for hiding this comment

Uh oh!

wesm commented Nov 26, 2017

Uh oh!

pcmoritz commented Nov 26, 2017

Uh oh!

wesm commented Nov 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcmoritz commented Nov 26, 2017

Uh oh!

wesm commented Nov 26, 2017

Uh oh!

pcmoritz commented Nov 26, 2017

Uh oh!

robertnishihara commented Nov 27, 2017

Uh oh!

wesm commented Nov 27, 2017

Uh oh!

pcmoritz commented Nov 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertnishihara commented Nov 27, 2017

Uh oh!

wesm commented Nov 27, 2017

Uh oh!

wesm commented Nov 27, 2017

Uh oh!

robertnishihara commented Nov 27, 2017

Uh oh!

pcmoritz commented Nov 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wesm commented Nov 26, 2017 •

edited

Loading

pcmoritz commented Nov 27, 2017 •

edited

Loading