-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18161] [Python] Allow pickle to serialize >4 GB objects when possible (Python 3.4+) #15670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18161] [Python] Allow pickle to serialize >4 GB objects when possible (Python 3.4+) #15670
Conversation
|
cc @davies |
python/pyspark/broadcast.py
Outdated
| from tempfile import NamedTemporaryFile | ||
|
|
||
| from pyspark.cloudpickle import print_exec | ||
| from pyspark.broadcast import protocol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broadcast -> serializers
| else: | ||
| import pickle | ||
| protocol = 3 | ||
| protocol = min(pickle.HIGHEST_PROTOCOL, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only change the protocol for Broadcast? Otherwise we should also make sure that the Pyrolite can support protocol 4, which is used by Spark SQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had only identified this as an issue for Broadcast and the serializer in the SparkContext (I wasn't using Spark SQL in my pyspark application when I came across this issue).
It looks like the highest supported version of Pickle (for writing) is Protocol 2 in Pyrolite (https://github.com/irmen/Pyrolite/blob/master/java/src/main/java/net/razorvine/pickle/Pickler.java#L36)
From the project README:
Pickle protocol version support: reading: 0,1,2,3,4; writing: 2.
Pyrolite can read all pickle protocol versions (0 to 4, so this includes
the latest additions made in Python 3.4).
Pyrolite always writes pickles in protocol version 2. There are no plans on
including protocol version 1 support. Protocols 3 and 4 contain some nice new
features which may eventually be utilized, but for now, only version 2 is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like this part of the change would be ok then since Pyrolite can read v4 just fine.
|
Do we plan to move this forward? Do you have the time to update this @singularperturbation ? |
|
Would you be ok with someone taking over this PR if your busy? |
|
I'm also interested in pushing this PR forward. Is there anything else that is needed to be done to merge it except rebasing it and resolving conflicts? |
|
So yeah if anyone has the spoons to take this over it should just be fixing the merge conflicts. Feel free to ping me on the PR. |
|
I'm still interested in merging this PR. I made a fork from the PR author's repo, pulled current Spark master branch and rebased the PR author's branch onto it. Now I have a branch with all fixed needed and no conflicts: https://github.com/inpefess/spark/tree/pickle_protocol_4 @holdenk, what should we do now? |
|
Up next make a PR to the Apache spark repo with a title following the same format as this one and then you can at mention me for the review. |
|
Jenkins ok to test. @inpefess if you can update this PR to master now is a great time to get this in since the next release after 2.4 is going to be 3 so it's easier to change formats and stuff. |
|
Test build #98097 has finished for PR 15670 at commit
|
|
#20691 is being taking over. Closing this. |
What changes were proposed in this pull request?
Since pickle protocol 4 (available in Python 3.4+) allows for serialization of > 4 GB errors, use pickle protocol 4 when available in the Broadcast and PickleSerializer classes.
How was this patch tested?
This was tested through pyspark shell and serializing a large numpy array with / without the changes.
Author: Sloane Simmons
This contribution is my original work and I license the work to the project under the project's open source license.