Pickling Python collections with non-built-in type keys and cycles
A presentation of various problems and an effective solution for pickling Python collections with non-built-in types as keys and cycles or self references. Applies to dictionaries, default dictionaries, ordered dictionaries, sets, and ordered sets, at all levels of the pickle and cPickle protocol, in Python 2 and 3.
The entirety of the implementation is available as an open source package python-restorable-collections.
In many object compositions in Python we effectively create pointer graphs because an object ‘contained’ in another object maintains a pointer back to its ‘container’; the typical example being a node in a tree with pointers to its children which in turn have pointers back to the parent node. Sometimes an object may even point to itself, such as when modeling relations in a system which allows self-relations. Further still, sometimes nominal types (defined through a class) are better suited as keys in dictionaries or entries in sets rather than having to use only built-in types corresponding to the objects.
The hash value of an instance object is necessary for the object to be used in any data structure which looks up objects based on hashes, including
set. Crucially, the hash value must never change after the object is used as a key, because Python’s dictionary implementation stores the hash value in a C structure behind the scenes and will not update (rehash) this structure should
__hash__ return different values during the lifetime of the object.
In our example below, the implementation of the
__hash__ method relies on the value of a
name field, however, during unpickling, the
__dict__ of the instance object - and hence the field name - will not be available until
__setstate__ has completed (either a custom version of this method defined in the class, or the built-in behavior which simply restores the instance object’s pickled state as the object’s
__dict__). Our fallback will be to return the default hash value for non-built-in objects, namely their
id. We cannot rely on protocol 2
__getnewargs__ either, because it will not have access to the state and is only used for supplying internal invariant data primarily useful for
When the unpickling algorithm needs to use an object as a key, therefore, it will invoke
__hash__ and internally store this value in the C structure implementing the lookup for the dictionary or set; usually, this will happen only after the instance object has already been restored via an invocation to
__setstate__ (or by default state restoration), because unpickling is bottom-up, in the sense that objects' states are restored prior to them being placed in other data structures.
However, there are a set of circumstances which make it impossible for the unpickling algorithm to restore the state prior to placing the object as a key in a dictionary or set: when the collection is an attribute of the very object being restored and the object itself is used as a key, or when two objects mutually form a reference cycle through their collection attributes. In these situations, the hash value (of at least one object) is needed prior to the collection attribute preparation.
Obviously this inability to unpickle certain heap configurations is not a flaw in the language specification nor a bug in its implementation, although it could be argued that pickling and unpickling should always restore the same heap shape if that heap was valid in the first place. External resources, such as sockets and files, by nature represent state and must be manually restored or recreated. However, Python objects not associated with the aforementioned resources are stateless in the light of the never-changing hash value requirement, and should be fair game for automatic restoration irrespective of their pointer graph.
Furthermore, discussion of other dictionary issues such as 1475692 and 1761028 makes it abundantly clear that there are certain performance characteristics of the C structures involved which necessitate maintaining these implementation decisions, and since they have been carried over to Python 3 they are here to stay with us for a long time.
One way of automatically solving this problem would be for the unpickling algorithm to maintain a look-aside table of allocated objects whose hash value was required prior to their state being restored, and then internally rehashing all C structures which contain them. Naturally there would be a performance hit, but it would be one-off during unpickling, and may well be preferable to refactoring code so it uses only built-in keys and needs to perform translation from the built-in key value to the object of interest. Designing or proposing an update to the pickle protocol is beyond the scope of this article, but I believe is worthwhile of future consideration.
Our example consists of a
World in which
Wizard instances cast
Spell instances on each other in epic battles of magic and so on. Some wizards may chose to also cast protective spells on themselves, which means that the caster and the target of the spell can be the same object. Cast spells are recorded with each wizard, along with the order in which they were cast, in the
spells dictionary which maps the spell’s target to the instance of the spell.
This creates the possibility for a wizard to contain a dictionary with herself as the key. Finally, note that in order for a wizard to be usable as a key the class must implement
__hash__ which relies on the hash of the name string with a fall-back to
Testing Without Pickling
The following execution works just fine and all tests pass. If most of these tests appear to be asserting the obvious, it is because they serve to demonstrate where the problem with unpickling lies:
Testing With Pickling
Now we pickle the world and extract the same object references from the unpickled version of the world. Several of the tests fail (marked with a highlight below), specifically all those which use
Merlin as a key, including when his object is obtained directly from the keys method of the dictionary. If you want to run the successful tests, then you’ll have to remove or comment out the ones which fail:
The third from last assertion on lines 49-50, which fails as below, is the crux of our problem. The last two assertions, which compare the identities and hashes of the objects, tell us that in fact both
u_wizard_merlin.spells.keys() are the same object. Since Merlin has cast a spell on himself, he has formed a (trivial) pointer cycle. Wizards.
Error Traceback (most recent call last): File "....py", line ..., in test_with_pickling u_wizard_merlin.spells[u_wizard_merlin.spells.keys()].target) KeyError: Merlin
Below I present one solution to this problem via means of automatically restoring dictionary or set keys after unpickling; for clarity I have not inherited directly from the built-in types (e.g.
set, and so on), rather, I have wrapped the underlying structures and provided interfaces conforming with the Collections Abstract Base Classes. Design issues such as 4712 and 26897 mean that the workarounds required for directly sub-classing
dict and maintaining pickle behavior make the outcome ridiculously fragile and error-prone.
We may be tempted to restore the contents of a collection in
__setattr__, however, this would not work: there may be keys in our collection which have not been restored by the unpickling yet, and their hash value may change after we perform the restoration. Therefore, we must perform the restoration after the unpickling algorithm has finished with all objects.
There are two ways of restoring collections after unpickling: either we keep track of which collections need restoration and restore them after unpickling, or we mark all such collections and intercept attribute access to their wrapped contents the first time they are accessed, restoring them as needed. I have chosen the latter strategy because it does not require a separate unpickling step nor a tracking structure, and because it is lazy in the sense that a collection will be restored only at the point of time it is used. Of course, the restoration logic can easily be extracted and invoked eagerly at any time after unpickling and prior to use.
This mix-in type is responsible for maintaining the restoration requirement marking of a collection, as well as intercepting access to the wrapped
_contents attribute and delegating to the concrete restoration method
_restore. This restoration method is expected to be handed the entirety of the information it needs from the
_restoration_data attribute, which must be made available by the subclass
__setstate__ method. Typically, the subclass
__getstate__ generates this information which is given to
__setstate__ during unpickling.
__setstate__ method, other than performing the default object state setting on
__dict__, marks the object as requiring restoration. Note that
__init__ uses the
False mark, because restoration is necessary only for unpickled objects.
If you inherit this class and override
__setstate__ then you will have to ensure that the
_requires_restoration marker is set to
True. Furthermore, note that
_restore is likely to access the object and the wrapped collection, so in order to avoid indefinite recursion we must set the marker to
False before delegating to
_restore. Finally, the
_restoration_data attribute is removed after restoration so no pointers remain to keys which may later be removed as part of normal collection use:
We wrap a normal
dict constructed by passing our arguments directly to it, inherit from
Restorable and implement the
__setstate__, and _restore methods, as well as provide an interface to the wrapped dictionary via means of
MutableMapping. Our restoration method simply updates the wrapped dictionary with the keys and values from the restoration data. Since unpickling is finished, the keys will now be placed in the dictionary’s C structure with their correct hash value:
Restorable Default Dictionary
Now that we have a restorable dictionary, we can extend it in order to support
defaultdict. We need only override the state-related methods, since we must now also persist the
Restorable Ordered Dictionary
Other than setting
_contents to be an
OrderedDict, we don’t need to do anything else because the
RestorableDict restoration follows the order of
iteritems, which ensures proper ordering in our case:
set, things are even simpler, as the
_restoration_data consist of all elements of the set, and are used to directly update the set in
We’ve worked around an inherent limitation of Python’s pickle protocol at the cost of more complexity as well as a performance hit. Specifically, we must now maintain restorable versions of collections, as well as accept a slow-down of unpickling and access to the wrapped collection.
In terms of performance, both slow-downs are a constant on top of their respective operations, and should not be prohibitive for small collections. If the collections involved are large and have many cyclic structures, then perhaps the question no longer is whether we can pickle effectively but whether we should be using pickled data or Python at all. Perhaps a dedicated graph database and query language are better suited.
With respect to maintenance complexity, it is highly likely that a lot of code will not accept these restorable collections if it makes any kind of type checking using
set instead of, respectively,
Set; this is a more general issue with the appropriate use of Python’s Collection Abstract Base Classes rather than our particular workaround.
Gratitude to Dr Anastasia Niarchou for her feedback.