Progress on _pickle.c

July 24, 2012

Most of the stuff has been implemented in _pickle.c now (sets/frozensets, nested globals, removal of BINPUT, nested globals, BAIL_OUT). It will soon reach a state where it supports everything pickle.py does.


View progress

July 22, 2012

In order for it to be easier to keep track of my progress, I have created a text file
which will be updated periodically.

The text file is available here


pickle4-midterm-report

July 10, 2012

Since proper formatting of text in wordpress appears to be rocket science to me, I’ve written the midterm report in text format and is available here.

 


pickle4-midterm

July 10, 2012

I have made some final changes to pickle4 so it is ready for midterm.
Latest changes:

  • Fixed self-referential frozensets
  • Added BINGLOBAL_BIG opcode for pickling unusually large globals
  • frozenset bug fix and better testing for sets and other stuff
  • Other TODO additions/removals in the code

Pickle4 is intensively tested and all tests currently pass. The current state
has been tagged in the repository with tag pickle4-midterm.

I will now work on a detailed report pointing out exactly what pickle4 does
in addition to pickle3. I will finish this report today (GMT+3) and post a link here.


July 4, 2012

Some stuff I’ve done lately:

  • Removed all uses of BINPUT (with one exception), so more compact pickles are generated now. This might have introduced very subtle bugs creating obscure scenarios in which an object is memoized when pickling but not when unpickling. I have not been able to think of such a scenario so far.
  • NEWOBJ_KW opcode: like NEWOBJ but allows keyword arguments to be passed to __new__

Most of the stuff for the Python version of pickle is done. Here are a few things that I still need to do:

  • Possibly implement “postponed initialization” for frozensets’ items, allowing for self-referential frozensets (self-referential sets are already allowed)
  • More error checking throughout pickle, in order to have the right exceptions raised and with more helpful messages
  • Testing of the raised exceptions
  • All code written so far has tests, but think of more.
  • Minor TODO changes spread throughout the code.
  • Finally, make sure that all tests pass and that we haven’t broken anything in libraries that use pickle (such as multiprocessing)

I can then start working on the C version:

  • Fix the lack of __module__ and __func__ for certain type of builtin methods and send a bug report with a patch etc.
  • Fix builtin __reduce_ex__ implementation to return a proper tuple which allows for self-referential objects

June 26, 2012

Here’s a quick overview of what I’ve done lately:

  • Changed pickle to use EMPTY_FROZENSET/EMPTY_SET + UPDATE_SET/UNION_FROZENSET
  • Wrote on python-dev asking for a couple of suggestions
  • Pickling of bound methods (with a small issue on differentiating between some builtin functions and methods)
  • BAIL_OUT opcode for situations where pickling has failed (BAIL_OUT is for failure like STOP is for success)

SET vs. EMPTYSET

June 21, 2012

On the SET vs. EMPTY_SET issue earlier, apparently I have missed the obvious fact that starting with an EMPTY_SET and then filling it allows for self-referential sets, whereas SET doesn’t. It also makes the job easier in the eventuality that we’ll remove BINPUT completely.

However, there’s still one scenario in which difficulties arise in building objects with cycles, namely when reduce is involved.

Example:

>>> class MySet(set):
...  pass
...
>>> class B:
...  pass
...
>>> s=MySet()
>>> b=B()
>>> s.add(b)
>>> b.foo = s
>>> s.__reduce_ex__()
(<class '__main__.MySet'>, ([<__main__.B object at 0x028D3658>],), {})

Clearly, pickling and then unpickling this object would be troublesome, because
one of the parameters of REDUCE would refer to the object itself, which has yet to be created.
Because of the way in which reduce was designed, it is impossible to create an empty
object and then “fill it” with the information provided by reduce, like pickle is currently
doing for lists, dicts and other builtins.

One solution for this is to make __reduce_ex__(proto) have different semantics in v3 vs. v4. In v4, the implementer of __reduce_ex__ would have to provide a function which creates an empty object of the desired type in addition to a function which fills the newly-created object with the data. This could get messy and create backward-incompatibilities though.

This problem has been discussed at least in issue9269 and issue998998.


Progress so far

June 20, 2012

As I’ve been busy with school projects and finals in the first few weeks, I’ve made less progress than I’ve expected.

However, I have finished with that now, and here is a list of things that I have done so far and what I’m planing to do next.

My work so far has been mostly in Lib/test/pickletester.py, Lib/pickletools.py and Lib/pickle.py.

Here’s an overview of the changes/improvements in each file:

Lib/pickle.py

  • SEMISHORT_BINUNICODE, SHORT_BINUNICODE, BINUNICODE64 for better serialization of arbitrarily-sized str objects
  • BINBYTES64, SEMISHORT_BINBYTES, serving the same purpose for the bytes type
  • SET, FROZENSET and UNION_SET which allows native serialization of sets and frozensets
  • Serialization of nested classes and functions
  • BINGLOBAL – like GLOBAL, but uses binary format for storing the module and attr name (as opposed to newline-termination)
  • BINGLOBAL_COMMON – like BINGLOBAL, but works with commonly used module names for more space-efficient pickling and unpickling of their names

Lib/pickletools.py

  • Documentation and support for all of the opcodes above
  • A maxversion function which returns the maximum version amongst the opcodes used for some pickled data (useful for testing)

Lib/test/pickletester.py

  • test_v4_efficient_unicode: tests space-efficient pickling strings of arbitrary sizes in v3 and v4
  • test_v4_efficient_binbytes: tests space-efficient pickling bytes of arbitrary sizes in v3 and v4
  • test_v4_nested_classes: tests pickling of nested classes
  • test_v4_sets: tests pickling of sets in v3 vs. in v4
  • test_v4_sets_opcodes: tests that sets/frozensets opcodes work properly
  • test_v4_exceptions: tests some exception consistency (not yet passing, see below)
  • All of the tests above make sure the improvements work properly in v4 and that they did not affect the way data is pickled in v3 (such as by making sure that the v4 pickled data takes less size than the v3 one and that maxversion() returns the right values)

Here are a couple examples illustrating changes in v4 so far:
Pickling sets:

>>> len( pickle.dumps(set([frozenset([1,2,3]), 4,5])) )
73
>>> len( pickle.dumps(set([frozenset([1,2,3]), 4,5]),4) )
21
>>> pickletools.dis( pickle.dumps(set([frozenset([1,2,3]), 4,5]),4) )
0: \x80 PROTO      4
2: (    MARK
3: K        BININT1    4
5: (        MARK
6: K            BININT1    1
8: K            BININT1    2
10: K            BININT1    3
12: \x94         FROZENSET  (MARK at 5)
13: q        BINPUT     0
15: K        BININT1    5
17: \x93     SET        (MARK at 2)
18: q    BINPUT     1
20: .    STOP

Pickling nested classes/functions:

>>> class A:
...  class B:
...   def f():
...    print('hello!')
...
>>> pickletools.dis(pickle.dumps(A.B.f,4))
0: \x80 PROTO      4
2: \x92 BINGLOBAL_COMMON '0 A.B.f'
10: q    BINPUT     0
12: .    STOP
highest protocol among opcodes = 4
>>> pickle.loads(pickle.dumps(A.B.f,4))()
hello!
>>> len(pickle.dumps(A.B.f,4))
13

Here’s a list of things that have made me think and that will need further discussion (possibly with the mentor and python-devel):

  1. Do we really need to actually use BINPUT to memoize objects? If the pickler and unpickler would agree to memoize the same objects with the same uids, couldn’t we get rid of the BINPUT opcodes most of the time? It is very often the case that objects are BINPUT-ted that are never BINGET-ted.

    pickletools.optimize solves this problem by disassembling the pickled data and removing unused binputs, but that function is rarely used in practice. pickle.dumps doesn’t use it, although it could. Also, most of the time, pickling is done right into a stream, making using the optimizer impossible (the optimizer obviously can’t work on-the-fly, it needs all of the pickled data).

    There are, of course, some scenarios in which it is currently difficult for the unpickler to know when to memoize an object, such as when constructing a list. Since the list is constructed by doing EMPTY_LIST+APPENDS+..+APPENDS, the loader does not have the a priori knowledge of when the list is finished constructing and should be memoized (there’s no marker for “I’m done with this list”). A solution for this shouldn’t be too hard to find.

    Even if we don’t remove BINPUT, we could still use an opcode which does not provide the index. The pickler and unpickler could easily agree to use consecutive indices and BINPUT would be able to memoize objects without having to report the index.
  2. For sets and frozensets, I’m currently using the newly-created opcodes SET and FROZENSET, both of which construct a set from the top slice of stack (all the elements upto the stack’s MARK).This is similar to what LIST does. For longer list, APPENDS can be used, which appends to the list in the top of the stack. In a similar way, UNION_SET was added, which unions the set/frozenset in the top of the stack with the topmost stack slice and creates a new set/frozenset.

    However, EMPTY_LIST+APPENDS is now used in favor of LIST+APPENDS, which makes me think that I should take the same approach and use EMPTY_SET+UNION_SET and EMPTY_FROZENSET+UNION_SET. But is there really an advantage of this? Beside the fact that creating empty sets takes one opcode less (MARK+SET vs. EMPTY_SET), I don’t see what advantage such an approach would have. Creating non-empty sets is, in fact, one opcode less: MARK+…+SET+…+UNION_SET as opposed to EMPTY_LIST+MARK+…+UNION_SET+…+UNION_SET.

    I don’t know any implementation details about sets, but it would seem that constructing a data structure with given elements and then adding chunks to it leaves more potential for optimization than creating an empty one and then adding chunks to that (i.e. initializing the data structure with the initial chunk as opposed to creating an empty one).
  3. More consistent exception throwing. There are some situations (such as applying a LIST opcode when a MARK doesn’t exist), in which cases exceptions that reveal implementation-detail for no reason are raised. In those circumstances, PicklingError/UnpicklingError would be more appropiate and more consistent with the C implementation.

    This would be nice to add in picklev4, but how should we proceed exactly?

    Should we make this consistency take effect for all versions of pickle, or just picklev4? Would it be acceptable for self.loads(data) to raise different exceptions depending on whether data is serialized with v3 or v4 (i.e. consistent ones in v4 but not v3)? Or would it be more acceptable to simply change the exceptions that are currently being raised? Are users of pickle relying on certain exceptions being thrown? In which circumstances?
  4. Classes/functions of which modules are likely to be pickled? Those module names should be added to pickle.V4_COMMON_MODULES for a more space-efficient serialization of their names (their index in the list will be serialized instead of their name). Currently we have: __main__=0, builtins=1.
  5. Would an opcode that compresses unicode strings before pickling be of interest? When encountering an unicode string in pickling with v4, the pickler would first compress it (see http://unicode.org/faq/compression.html) and then pickle whichever variant is smaller (the compressed or the uncompressed one) using BINUNICODE_COMPRESSED vs. BINUNICODE.If this is too much overhead, an option for the user could eventually be added to disable this time-space tradeoff.
  6. Add support for serializing bound methods? This should be easily doable by serializing the pair (self, func), now that pickling functions that are part of classes is possible. Currently, a work-around for e.g. serializing a.f is doing dumps( (a, A.f), 4) and rebinding when loading.

If you are interested, you can monitor my progress at https://bitbucket.org/mstefanro/pickle4/.

The exact diffs are at https://bitbucket.org/mstefanro/pickle4/compare/..mirror/cpython.


Initial post

April 17, 2012

Should I get accepted, this blog will be used for keeping track of my work on pickle v4.

My first patch can be reached here, and a PoC here.

My original proposal is available here.


Design a site like this with WordPress.com
Get started