Skip to content

Conversation

@singularperturbation
Copy link
Contributor

What changes were proposed in this pull request?

Since pickle protocol 4 (available in Python 3.4+) allows for serialization of > 4 GB errors, use pickle protocol 4 when available in the Broadcast and PickleSerializer classes.

How was this patch tested?

This was tested through pyspark shell and serializing a large numpy array with / without the changes.

Author: Sloane Simmons

This contribution is my original work and I license the work to the project under the project's open source license.

@singularperturbation singularperturbation changed the title [SPARK-18161] [Python] Allow use of pickle to serialize > 4 GB objects where available [SPARK-18161] [Python] Allow use of pickle to serialize >4 GB objects where available Oct 28, 2016
@singularperturbation singularperturbation changed the title [SPARK-18161] [Python] Allow use of pickle to serialize >4 GB objects where available [SPARK-18161] [Python] Allow pickle to serialize >4 GB objects when possible (Python 3.4+) Oct 28, 2016
@rxin
Copy link
Contributor

rxin commented Oct 31, 2016

cc @davies

from tempfile import NamedTemporaryFile

from pyspark.cloudpickle import print_exec
from pyspark.broadcast import protocol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broadcast -> serializers

else:
import pickle
protocol = 3
protocol = min(pickle.HIGHEST_PROTOCOL, 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only change the protocol for Broadcast? Otherwise we should also make sure that the Pyrolite can support protocol 4, which is used by Spark SQL.

Copy link
Contributor Author

@singularperturbation singularperturbation Oct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had only identified this as an issue for Broadcast and the serializer in the SparkContext (I wasn't using Spark SQL in my pyspark application when I came across this issue).

It looks like the highest supported version of Pickle (for writing) is Protocol 2 in Pyrolite (https://github.com/irmen/Pyrolite/blob/master/java/src/main/java/net/razorvine/pickle/Pickler.java#L36)
From the project README:

Pickle protocol version support: reading: 0,1,2,3,4; writing: 2.
Pyrolite can read all pickle protocol versions  (0 to 4, so this includes
the latest additions made in Python 3.4).
Pyrolite always writes pickles in protocol version 2. There are no plans on 
including protocol version 1 support. Protocols 3 and 4 contain some nice new
features which may eventually be utilized, but for now, only version 2 is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this part of the change would be ok then since Pyrolite can read v4 just fine.

@holdenk
Copy link
Contributor

holdenk commented Oct 10, 2017

Do we plan to move this forward? Do you have the time to update this @singularperturbation ?

@holdenk
Copy link
Contributor

holdenk commented Nov 18, 2017

Would you be ok with someone taking over this PR if your busy?

@inpefess
Copy link

I'm also interested in pushing this PR forward. Is there anything else that is needed to be done to merge it except rebasing it and resolving conflicts?

@holdenk
Copy link
Contributor

holdenk commented Feb 28, 2018

So yeah if anyone has the spoons to take this over it should just be fixing the merge conflicts. Feel free to ping me on the PR.

@inpefess
Copy link

inpefess commented Feb 28, 2018

I'm still interested in merging this PR. I made a fork from the PR author's repo, pulled current Spark master branch and rebased the PR author's branch onto it. Now I have a branch with all fixed needed and no conflicts: https://github.com/inpefess/spark/tree/pickle_protocol_4

@holdenk, what should we do now?

@holdenk
Copy link
Contributor

holdenk commented Feb 28, 2018

Up next make a PR to the Apache spark repo with a title following the same format as this one and then you can at mention me for the review.

@inpefess
Copy link

@holdenk done: #20691

@holdenk
Copy link
Contributor

holdenk commented Oct 26, 2018

Jenkins ok to test. @inpefess if you can update this PR to master now is a great time to get this in since the next release after 2.4 is going to be 3 so it's easier to change formats and stuff.

@SparkQA
Copy link

SparkQA commented Oct 26, 2018

Test build #98097 has finished for PR 15670 at commit ce7893a.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

#20691 is being taking over. Closing this.

@singularperturbation singularperturbation deleted the pickle_protocol_4 branch February 9, 2019 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants