Python: why pickle?

Question

I have been using pickle and was very happy, then I saw this article: Don't Pickle Your Data

Reading further it seems like:

Pickle is slow
Pickle is unsafe
Pickle isn’t human readable
Pickle isn’t language-agnostic

I’ve switched to saving my data as JSON, but I wanted to know about best practice:

Given all these issues, when would you ever use pickle? What specific situations call for using it?

BTW, there are formats that are far more human-readable than JSON, and arguably easier to edit too. Both good old INI files and YAML come to mind. It's certainly better than an opaque binary stream, but human readability isn't a binary thing. — user395760
– user395760, Commented Feb 13, 2014 at 11:21
First downside I see for saving objects as JSON : You have to create your serializers, and that takes some time. Plus the speed of your JSON process to serialize might, in the end, be slower than a simple pickle. Though I agree on the security downside. Another point is : Why do you want to store an object and let it be editable ? Couldn't that be unsafe ? — Depado
– Depado, Commented Feb 13, 2014 at 12:11
Why use a hammer when you have a screwdriver ? Whay use a screwdriver when you have a hammer ? It's all about choosing the righ tool for the job at hand. — bruno desthuilliers
– bruno desthuilliers, Commented Feb 13, 2014 at 12:41
This is basically the same post as: stackoverflow.com/questions/8968884/…. If you are concerned about security, don't rely on pickle or JSON. Use a stronger authentication service -- something with an encryption key. — Mike McKerns
– Mike McKerns, Commented Feb 20, 2014 at 12:26
Given the extra work that pickle has to do (in comparison to e.g. oversimplified formats such a JSON) to make sure references to objects that are already represented are found, it is not slow at all. Whereas json throws an error on even extremely simple things like import json; d = [1]; d.append(d); json.dumps(d) — Anthon
– Anthon, Commented Feb 14, 2017 at 13:04

score 36 · Accepted Answer · 2014-02-13 11:20:21Z

36

Pickle is unsafe because it constructs arbitrary Python objects by invoking arbitrary functions. However, this is also gives it the power to serialize almost any Python object, without any boilerplate or even white-/black-listing (in the common case). That's very desirable for some use cases:

Quick & easy serialization, for example for pausing and resuming a long-running but simple script. None of the concerns matter here, you just want to dump the program's state as-is and load it later.
Sending arbitrary Python data to other processes or computers, as in multiprocessing. The security concerns may apply (but mostly don't), the generality is absolutely necessary, and humans won't have to read it.

In other cases, none of the drawbacks is quite enough to justify the work of mapping your stuff to JSON or another restrictive data model. Maybe you don't expect to need human readability/safety/cross-language compatibility or maybe you can do without. Remember, You Ain't Gonna Need It. Using JSON would be the right thing™ but right doesn't always equal good.

You'll notice that I completely ignored the "slow" downside. That's because it's partially misleading: Pickle is indeed slower for data that fits the JSON model (strings, numbers, arrays, maps) perfectly, but if your data's like that you should use JSON for other reasons anyway. If your data isn't like that (very likely), you also need to take into account the custom code you'll need to turn your objects into JSON data, and the custom code you'll need to turn JSON data back into your objects. It adds both engineering effort and run-time overhead, which must be quantified on a case-by-case basis.

edited Feb 13, 2014 at 11:20

answered Feb 13, 2014 at 11:14

user395760

Sign up to request clarification or add additional context in comments.

3 Comments

e h Over a year ago

Thanks for a great answer. Good to know what is the right thing™ even if it does not always == good

James Lim Over a year ago

In multiprocessing, and in spark too. When working with RDDs, spark will serialize your user defined functions (passed to map, flatmap) using pickle because it can serialize almost any python object.

Ari Over a year ago

For us, clueless newbies, unaware of what "The Right Thing™" is about, ell.stackexchange.com/questions/17108/…

Sneftel · Accepted Answer · 2014-02-13 11:12:37Z

6

Pickle has the advantage of convenience -- it can serialize arbitrary object graphs with no extra work, and works on a pretty broad range of Python types. With that said, it would be unusual for me to use Pickle in new code. JSON is just a lot cleaner to work with.

answered Feb 13, 2014 at 11:12

Sneftel

41.1k13 gold badges79 silver badges113 bronze badges

Comments

wzab · Accepted Answer · 2014-02-13 16:33:57Z

4

I usually use neither Pickle, nor JSON, but MessagePack it is both safe and fast, and produces serialized data of small size.

An additional advantage is possibility to exchange data with software written in other languages (which of course is also true in case of JSON).

answered Feb 13, 2014 at 16:33

wzab

8508 silver badges30 bronze badges

3 Comments

CadentOrange Over a year ago

JSON's biggest advantage IMHO it is both concise (unlike XML) and human readable (unlike MessagePack). I'm not sure that the size saved by MessagePack is significant enough to negate those two benefits.

Joe Over a year ago

It isn't so much as size savings in MessagePack, but that you can encode things that JSON doesn't do well, like binary data.

Display Name Over a year ago

MessagePack can't serialize sets, what a shame

Ahmed Abobakr · Accepted Answer · 2017-12-12 02:46:03Z

4

I have tried several methods and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL) is the fastest dump method.

import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np

num_tests = 10

obj = np.random.normal(0.5, 1, [240, 320, 3])

command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle:  %f seconds" % result)

command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle:   %f seconds" % result)


command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest:   %f seconds" % result)

command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json:   %f seconds" % result)


command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack:   %f seconds" % result)

Output:

pickle         :   0.847938 seconds
cPickle        :   0.810384 seconds
cPickle highest:   0.004283 seconds
json           :   1.769215 seconds
msgpack        :   0.270886 seconds

So, I prefer cPickle with the highest dumping protocol in situations that require real time performance such as video streaming from a camera to a server.

answered Dec 12, 2017 at 2:46

Ahmed Abobakr

1,66619 silver badges28 bronze badges

1 Comment

Yakov Galka Over a year ago

Even though it might be that cPickle is the fastest, your tests do not show that. You only test a simple but big-ish numpy array, in which case cPickle probably turns it into a memcpy of some sort. That's a case when a serialization library isn't even needed. To adequately compare between the various methods, create a data structure with nested dicts, lists, strings, numbers, and perhaps some custom classes added to the mix.

Community · Accepted Answer · 2017-05-23 12:34:01Z

2

You can find some answer on JSON vs. Pickle security: JSON can only pickle unicode, int, float, NoneType, bool, list and dict. You can't use it if you want to pickle more advanced objects such as classes instance. Note that for those kinds of pickle, there is no hope to be language agnostic.

Also using cPickle instead of Pickle partially resolve the speed progress.

edited May 23, 2017 at 12:34

CommunityBot

11 silver badge

answered Feb 13, 2014 at 11:12

hivert

10.7k3 gold badges34 silver badges58 bronze badges

1 Comment

e h Over a year ago

I though cPickle was quicker too, then I saw: stackoverflow.com/questions/16833124/…

Collectives™ on Stack Overflow

Python: why pickle?

5 Answers 5

3 Comments

Comments

3 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

Comments

3 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related