-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1854: [Python] Use pickle to serialize numpy arrays of objects. #1360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyarrow/serialization.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be made even more efficient by converting this to a Buffer so it's sent in a separate sidecar (for zero copy until pickle.loads is called). I will play around with this and see what the performance is like
|
I made some minor tweaks to send the pickle as a buffer rather than packing the bytes into the union: This seems pretty acceptable to me. Without this patch we have on my machine The savings become more significant when there are repeated Python objects, I presume. How significant is the non-importable user-defined class issue? |
|
That's a very nice speedup! However, I'd vote against merging this for now. It seems not that crucial for numpy arrays of custom objects to be fast and it is better to be general. Also I'd like to fall back to pickle as little as possible so we can transfer data between languages easily in the future. If somebody runs into this being too slow of course we should reinvestigate. There are some speedups possible even if we don't fall back to pickle (like getting rid of the temporary copy of the list and using an iterator instead). |
|
So the context is ARROW-1783 and ARROW-1784. The goal is to be able to send pandas objects over the wire, and these frequently (anytime you have strings, for example) have object arrays internally. So having object arrays going 3-6x slower than pickle is a significant issue. |
|
Ok, that is reasonable. Let me see if I can speed this up without falling back to pickle. |
|
Probably the bigger issue is that Python objects are not being deduplicated like they are in pickle. I would at least like to have the option of using pickle for these arrays if we cannot get within 20% or the performance of pickle |
|
Yeah I agree with this! Ideally we would also do the deduplication, this is coming up from time to time, see also https://issues.apache.org/jira/browse/ARROW-1382. |
|
@wesm, you're right, the overhead comes from the fact that import numpy as np
import pickle
import pyarrow as pa
arr = np.array([str(i) for i in range(300000)], dtype=object)
%timeit pickle.dumps(arr)
41.9 ms ± 750 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pa.serialize(arr).to_buffer()
30.9 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)Given that the current code works in all cases (as far as I can tell) and is more performant in some cases, I still prefer the current code. However, if duplication is the common case for dataframes, then your optimization makes sense. In that case, would it be possible to move this code into the custom serializer for pandas dataframes instead of for numpy arrays? Or is that infeasible? If that is infeasible, I'd prefer to enable/disable this optimization with some sort of configuration flag because any reasonable cases that we can't handle will lead to complaints. Longer term, maybe this will all be solved by handling duplication properly. |
|
Do you have some object arrays where this might cause incompatibilities? I am fine with a flag to turn on this behavior, will have to think about what to call it |
|
Yeah I came to the same conclusion. I'd like to implement duplication at some point on the serialization level but it certainly won't happen before the release. Meanwhile a custom serialization context or a flag are both good solutions. |
|
I don't have any example arrays at the moment. However, it feels like the kind of thing that will come up. A custom serialization context makes sense to me (or having the downstream application register a more performant but less general custom serializer/deserializer). @wesm In the scenario you're working on, are these numpy arrays of objects only being created by the pandas custom serializers? Or are they coming from somewhere else? If this mostly arises from pandas, handling this in the custom pandas serializers might solve the problem. |
|
Right, the idea is that we're serializing the internal NumPy arrays in a pandas DataFrame. So the downside is that Dask will have to bother with custom serializers, but it's not such a big deal (since it will need to feature some custom handling of |
…re memory-efficient transport Change-Id: I751e5a21201a501a508873174a76355d06282089
…xt member that uses pickle for NumPy arrays with unsupported tensor types Change-Id: Ia70c26954ff9ab3af435281bcafbf298c8c0cf28
Change-Id: Ibb0a48a8569d25ae887c267546c551beb74a8d0f
|
OK, I added a |
|
Looks good to me! |
|
+1 LGTM once the test failure is fixed |
**Just posting this for discussion.** See the preceding discussion on https://issues.apache.org/jira/browse/ARROW-1854. I think the ideal way to solve this would actually be to improve our handling of lists, which should be possible given that pickle seems to outperform us by 6x according to the benchmarks in https://issues.apache.org/jira/browse/ARROW-1854. Note that the implementation in this PR will not handle numpy arrays of user-defined classes because it will not fall back to cloudpickle when needed. cc @pcmoritz @wesm Author: Wes McKinney <wes.mckinney@twosigma.com> Author: Robert Nishihara <robertnishihara@gmail.com> Closes #1360 from robertnishihara/numpyobject and squashes the following commits: c37a0a0 [Wes McKinney] Fix flake 5191503 [Wes McKinney] Fix post rebase 43f2c80 [Wes McKinney] Add SerializationContext.clone method. Add pandas_serialization_context member that uses pickle for NumPy arrays with unsupported tensor types c944023 [Wes McKinney] Use pickle.HIGHEST_PROTOCOL, convert to Buffer then memoryview for more memory-efficient transport cf719c3 [Robert Nishihara] Use pickle to serialize numpy arrays of objects.
Just posting this for discussion. See the preceding discussion on https://issues.apache.org/jira/browse/ARROW-1854.
I think the ideal way to solve this would actually be to improve our handling of lists, which should be possible given that pickle seems to outperform us by 6x according to the benchmarks in https://issues.apache.org/jira/browse/ARROW-1854.
Note that the implementation in this PR will not handle numpy arrays of user-defined classes because it will not fall back to cloudpickle when needed.
cc @pcmoritz @wesm