Skip to content

Conversation

@robertnishihara
Copy link
Contributor

Just posting this for discussion. See the preceding discussion on https://issues.apache.org/jira/browse/ARROW-1854.

I think the ideal way to solve this would actually be to improve our handling of lists, which should be possible given that pickle seems to outperform us by 6x according to the benchmarks in https://issues.apache.org/jira/browse/ARROW-1854.

Note that the implementation in this PR will not handle numpy arrays of user-defined classes because it will not fall back to cloudpickle when needed.

cc @pcmoritz @wesm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be made even more efficient by converting this to a Buffer so it's sent in a separate sidecar (for zero copy until pickle.loads is called). I will play around with this and see what the performance is like

@wesm wesm changed the title [DO NOT MERGE] Use pickle to serialize numpy arrays of objects. ARROW-1854: [Python] Use pickle to serialize numpy arrays of objects. Nov 26, 2017
@wesm
Copy link
Member

wesm commented Nov 26, 2017

I made some minor tweaks to send the pickle as a buffer rather than packing the bytes into the union:

>>> arr = np.array(['foo', 'bar', None] * 100000, dtype=object)

>>> %timeit serialized = pa.serialize(arr)
4.66 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit pickle.dumps(arr, protocol=pickle.HIGHEST_PROTOCOL)
4.53 ms ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This seems pretty acceptable to me. Without this patch we have on my machine

>>> %timeit serialized = pa.serialize(arr)
24.1 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The savings become more significant when there are repeated Python objects, I presume.

How significant is the non-importable user-defined class issue?

@pcmoritz
Copy link
Contributor

That's a very nice speedup!

However, I'd vote against merging this for now. It seems not that crucial for numpy arrays of custom objects to be fast and it is better to be general. Also I'd like to fall back to pickle as little as possible so we can transfer data between languages easily in the future.

If somebody runs into this being too slow of course we should reinvestigate. There are some speedups possible even if we don't fall back to pickle (like getting rid of the temporary copy of the list and using an iterator instead).

@wesm
Copy link
Member

wesm commented Nov 26, 2017

So the context is ARROW-1783 and ARROW-1784. The goal is to be able to send pandas objects over the wire, and these frequently (anytime you have strings, for example) have object arrays internally. So having object arrays going 3-6x slower than pickle is a significant issue.

@pcmoritz
Copy link
Contributor

Ok, that is reasonable. Let me see if I can speed this up without falling back to pickle.

@wesm
Copy link
Member

wesm commented Nov 26, 2017

Probably the bigger issue is that Python objects are not being deduplicated like they are in pickle. I would at least like to have the option of using pickle for these arrays if we cannot get within 20% or the performance of pickle

@pcmoritz
Copy link
Contributor

Yeah I agree with this! Ideally we would also do the deduplication, this is coming up from time to time, see also https://issues.apache.org/jira/browse/ARROW-1382.

@robertnishihara
Copy link
Contributor Author

@wesm, you're right, the overhead comes from the fact that pyarrow.serialize doesn't handle duplication well. In the case where there is little or no duplication, pyarrow.serialize seems to outperform pickle.dumps. For example, see the following:

import numpy as np
import pickle
import pyarrow as pa

arr = np.array([str(i) for i in range(300000)], dtype=object)

%timeit pickle.dumps(arr)
41.9 ms ± 750 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pa.serialize(arr).to_buffer()
30.9 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Given that the current code works in all cases (as far as I can tell) and is more performant in some cases, I still prefer the current code.

However, if duplication is the common case for dataframes, then your optimization makes sense. In that case, would it be possible to move this code into the custom serializer for pandas dataframes instead of for numpy arrays? Or is that infeasible? If that is infeasible, I'd prefer to enable/disable this optimization with some sort of configuration flag because any reasonable cases that we can't handle will lead to complaints.

Longer term, maybe this will all be solved by handling duplication properly.

@wesm
Copy link
Member

wesm commented Nov 27, 2017

Do you have some object arrays where this might cause incompatibilities? I am fine with a flag to turn on this behavior, will have to think about what to call it

@pcmoritz
Copy link
Contributor

pcmoritz commented Nov 27, 2017

Yeah I came to the same conclusion. I'd like to implement duplication at some point on the serialization level but it certainly won't happen before the release.

Meanwhile a custom serialization context or a flag are both good solutions.

@robertnishihara
Copy link
Contributor Author

I don't have any example arrays at the moment. However, it feels like the kind of thing that will come up.

A custom serialization context makes sense to me (or having the downstream application register a more performant but less general custom serializer/deserializer).

@wesm In the scenario you're working on, are these numpy arrays of objects only being created by the pandas custom serializers? Or are they coming from somewhere else? If this mostly arises from pandas, handling this in the custom pandas serializers might solve the problem.

@wesm
Copy link
Member

wesm commented Nov 27, 2017

Right, the idea is that we're serializing the internal NumPy arrays in a pandas DataFrame. So the downside is that Dask will have to bother with custom serializers, but it's not such a big deal (since it will need to feature some custom handling of pyarrow.Buffer anyway)

robertnishihara and others added 4 commits November 27, 2017 16:52
…re memory-efficient transport

Change-Id: I751e5a21201a501a508873174a76355d06282089
…xt member that uses pickle for NumPy arrays with unsupported tensor types

Change-Id: Ia70c26954ff9ab3af435281bcafbf298c8c0cf28
Change-Id: Ibb0a48a8569d25ae887c267546c551beb74a8d0f
@wesm
Copy link
Member

wesm commented Nov 27, 2017

OK, I added a pandas_serialization_context variable, if that sounds OK

@robertnishihara
Copy link
Contributor Author

Looks good to me!

@pcmoritz
Copy link
Contributor

+1 LGTM once the test failure is fixed

Change-Id: I45bbd9d67c5193e1d1fdc9a9ca59c61c103e5b65
@wesm wesm closed this in 0d6f5bf Nov 29, 2017
wesm added a commit that referenced this pull request Dec 1, 2017
**Just posting this for discussion.** See the preceding discussion on https://issues.apache.org/jira/browse/ARROW-1854.

I think the ideal way to solve this would actually be to improve our handling of lists, which should be possible given that pickle seems to outperform us by 6x according to the benchmarks in https://issues.apache.org/jira/browse/ARROW-1854.

Note that the implementation in this PR will not handle numpy arrays of user-defined classes because it will not fall back to cloudpickle when needed.

cc @pcmoritz @wesm

Author: Wes McKinney <wes.mckinney@twosigma.com>
Author: Robert Nishihara <robertnishihara@gmail.com>

Closes #1360 from robertnishihara/numpyobject and squashes the following commits:

c37a0a0 [Wes McKinney] Fix flake
5191503 [Wes McKinney] Fix post rebase
43f2c80 [Wes McKinney] Add SerializationContext.clone method. Add pandas_serialization_context member that uses pickle for NumPy arrays with unsupported tensor types
c944023 [Wes McKinney] Use pickle.HIGHEST_PROTOCOL, convert to Buffer then memoryview for more memory-efficient transport
cf719c3 [Robert Nishihara] Use pickle to serialize numpy arrays of objects.
@robertnishihara robertnishihara deleted the numpyobject branch February 8, 2018 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants