43

I have been using pickle and was very happy, then I saw this article: Don't Pickle Your Data

Reading further it seems like:

I’ve switched to saving my data as JSON, but I wanted to know about best practice:

Given all these issues, when would you ever use pickle? What specific situations call for using it?

5
  • 2
    BTW, there are formats that are far more human-readable than JSON, and arguably easier to edit too. Both good old INI files and YAML come to mind. It's certainly better than an opaque binary stream, but human readability isn't a binary thing. Commented Feb 13, 2014 at 11:21
  • First downside I see for saving objects as JSON : You have to create your serializers, and that takes some time. Plus the speed of your JSON process to serialize might, in the end, be slower than a simple pickle. Though I agree on the security downside. Another point is : Why do you want to store an object and let it be editable ? Couldn't that be unsafe ? Commented Feb 13, 2014 at 12:11
  • 4
    Why use a hammer when you have a screwdriver ? Whay use a screwdriver when you have a hammer ? It's all about choosing the righ tool for the job at hand. Commented Feb 13, 2014 at 12:41
  • This is basically the same post as: stackoverflow.com/questions/8968884/…. If you are concerned about security, don't rely on pickle or JSON. Use a stronger authentication service -- something with an encryption key. Commented Feb 20, 2014 at 12:26
  • Given the extra work that pickle has to do (in comparison to e.g. oversimplified formats such a JSON) to make sure references to objects that are already represented are found, it is not slow at all. Whereas json throws an error on even extremely simple things like import json; d = [1]; d.append(d); json.dumps(d) Commented Feb 14, 2017 at 13:04

5 Answers 5

36

Pickle is unsafe because it constructs arbitrary Python objects by invoking arbitrary functions. However, this is also gives it the power to serialize almost any Python object, without any boilerplate or even white-/black-listing (in the common case). That's very desirable for some use cases:

  • Quick & easy serialization, for example for pausing and resuming a long-running but simple script. None of the concerns matter here, you just want to dump the program's state as-is and load it later.
  • Sending arbitrary Python data to other processes or computers, as in multiprocessing. The security concerns may apply (but mostly don't), the generality is absolutely necessary, and humans won't have to read it.

In other cases, none of the drawbacks is quite enough to justify the work of mapping your stuff to JSON or another restrictive data model. Maybe you don't expect to need human readability/safety/cross-language compatibility or maybe you can do without. Remember, You Ain't Gonna Need It. Using JSON would be the right thing™ but right doesn't always equal good.

You'll notice that I completely ignored the "slow" downside. That's because it's partially misleading: Pickle is indeed slower for data that fits the JSON model (strings, numbers, arrays, maps) perfectly, but if your data's like that you should use JSON for other reasons anyway. If your data isn't like that (very likely), you also need to take into account the custom code you'll need to turn your objects into JSON data, and the custom code you'll need to turn JSON data back into your objects. It adds both engineering effort and run-time overhead, which must be quantified on a case-by-case basis.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for a great answer. Good to know what is the right thing™ even if it does not always == good
In multiprocessing, and in spark too. When working with RDDs, spark will serialize your user defined functions (passed to map, flatmap) using pickle because it can serialize almost any python object.
For us, clueless newbies, unaware of what "The Right Thing™" is about, ell.stackexchange.com/questions/17108/…
6

Pickle has the advantage of convenience -- it can serialize arbitrary object graphs with no extra work, and works on a pretty broad range of Python types. With that said, it would be unusual for me to use Pickle in new code. JSON is just a lot cleaner to work with.

Comments

4

I usually use neither Pickle, nor JSON, but MessagePack it is both safe and fast, and produces serialized data of small size.

An additional advantage is possibility to exchange data with software written in other languages (which of course is also true in case of JSON).

3 Comments

JSON's biggest advantage IMHO it is both concise (unlike XML) and human readable (unlike MessagePack). I'm not sure that the size saved by MessagePack is significant enough to negate those two benefits.
It isn't so much as size savings in MessagePack, but that you can encode things that JSON doesn't do well, like binary data.
MessagePack can't serialize sets, what a shame
4

I have tried several methods and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL) is the fastest dump method.

import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np

num_tests = 10

obj = np.random.normal(0.5, 1, [240, 320, 3])

command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle:  %f seconds" % result)

command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle:   %f seconds" % result)


command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest:   %f seconds" % result)

command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json:   %f seconds" % result)


command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack:   %f seconds" % result)

Output:

pickle         :   0.847938 seconds
cPickle        :   0.810384 seconds
cPickle highest:   0.004283 seconds
json           :   1.769215 seconds
msgpack        :   0.270886 seconds

So, I prefer cPickle with the highest dumping protocol in situations that require real time performance such as video streaming from a camera to a server.

1 Comment

Even though it might be that cPickle is the fastest, your tests do not show that. You only test a simple but big-ish numpy array, in which case cPickle probably turns it into a memcpy of some sort. That's a case when a serialization library isn't even needed. To adequately compare between the various methods, create a data structure with nested dicts, lists, strings, numbers, and perhaps some custom classes added to the mix.
2

You can find some answer on JSON vs. Pickle security: JSON can only pickle unicode, int, float, NoneType, bool, list and dict. You can't use it if you want to pickle more advanced objects such as classes instance. Note that for those kinds of pickle, there is no hope to be language agnostic.

Also using cPickle instead of Pickle partially resolve the speed progress.

1 Comment

I though cPickle was quicker too, then I saw: stackoverflow.com/questions/16833124/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.