Pickling Python collections with non-built-in type keys and cycles
A presentation of various problems and an effective solution for pickling Python collections with non-built-in types as keys and cycles or self references. Applies to dictionaries, default dictionaries, ordered dictionaries, sets, and ordered sets, at all levels of the pickle and cPickle protocol, in Python 2 and 3.
The entirety of the implementation is available as an open source package python-restorable-collections.
Motivation
In many object compositions in Python we effectively create pointer graphs because an object ‘contained’ in another object maintains a pointer back to its ‘container’; the typical example being a node in a tree with pointers to its children which in turn have pointers back to the parent node. Sometimes an object may even point to itself, such as when modeling relations in a system which allows self-relations. Further still, sometimes nominal types (defined through a class) are better suited as keys in dictionaries or entries in sets rather than having to use only built-in types corresponding to the objects.
Problem
The hash value of an instance object is necessary for the object to be used in any data structure which looks up objects based on hashes, including dict
, OrderedDict
, and set
. Crucially, the hash value must never change after the object is used as a key, because Python’s dictionary implementation stores the hash value in a C structure behind the scenes and will not update (rehash) this structure should __hash__
return different values during the lifetime of the object.
In our example below, the implementation of the __hash__
method relies on the value of a name
field, however, during unpickling, the __dict__
of the instance object - and hence the field name - will not be available until __setstate__
has completed (either a custom version of this method defined in the class, or the built-in behavior which simply restores the instance object’s pickled state as the object’s __dict__
). Our fallback will be to return the default hash value for non-built-in objects, namely their id
. We cannot rely on protocol 2 __getnewargs__
either, because it will not have access to the state and is only used for supplying internal invariant data primarily useful for __new__
.
When the unpickling algorithm needs to use an object as a key, therefore, it will invoke __hash__
and internally store this value in the C structure implementing the lookup for the dictionary or set; usually, this will happen only after the instance object has already been restored via an invocation to __setstate__
(or by default state restoration), because unpickling is bottom-up, in the sense that objects' states are restored prior to them being placed in other data structures.
However, there are a set of circumstances which make it impossible for the unpickling algorithm to restore the state prior to placing the object as a key in a dictionary or set: when the collection is an attribute of the very object being restored and the object itself is used as a key, or when two objects mutually form a reference cycle through their collection attributes. In these situations, the hash value (of at least one object) is needed prior to the collection attribute preparation.
Discussion
Obviously this inability to unpickle certain heap configurations is not a flaw in the language specification nor a bug in its implementation, although it could be argued that pickling and unpickling should always restore the same heap shape if that heap was valid in the first place. External resources, such as sockets and files, by nature represent state and must be manually restored or recreated. However, Python objects not associated with the aforementioned resources are stateless in the light of the never-changing hash value requirement, and should be fair game for automatic restoration irrespective of their pointer graph.
Furthermore, discussion of other dictionary issues such as 1475692 and 1761028 makes it abundantly clear that there are certain performance characteristics of the C structures involved which necessitate maintaining these implementation decisions, and since they have been carried over to Python 3 they are here to stay with us for a long time.
One way of automatically solving this problem would be for the unpickling algorithm to maintain a look-aside table of allocated objects whose hash value was required prior to their state being restored, and then internally rehashing all C structures which contain them. Naturally there would be a performance hit, but it would be one-off during unpickling, and may well be preferable to refactoring code so it uses only built-in keys and needs to perform translation from the built-in key value to the object of interest. Designing or proposing an update to the pickle protocol is beyond the scope of this article, but I believe is worthwhile of future consideration.
Example
Our example consists of a World
in which Wizard
instances cast Spell
instances on each other in epic battles of magic and so on. Some wizards may chose to also cast protective spells on themselves, which means that the caster and the target of the spell can be the same object. Cast spells are recorded with each wizard, along with the order in which they were cast, in the spells
dictionary which maps the spell’s target to the instance of the spell.
This creates the possibility for a wizard to contain a dictionary with herself as the key. Finally, note that in order for a wizard to be usable as a key the class must implement __hash__
which relies on the hash of the name string with a fall-back to id
:
|
|
Testing Without Pickling
The following execution works just fine and all tests pass. If most of these tests appear to be asserting the obvious, it is because they serve to demonstrate where the problem with unpickling lies:
|
|
Testing With Pickling
Now we pickle the world and extract the same object references from the unpickled version of the world. Several of the tests fail (marked with a highlight below), specifically all those which use Merlin
as a key, including when his object is obtained directly from the keys method of the dictionary. If you want to run the successful tests, then you’ll have to remove or comment out the ones which fail:
|
|
The third from last assertion on lines 49-50, which fails as below, is the crux of our problem. The last two assertions, which compare the identities and hashes of the objects, tell us that in fact both u_wizard_merlin
and u_wizard_merlin.spells.keys()[1]
are the same object. Since Merlin has cast a spell on himself, he has formed a (trivial) pointer cycle. Wizards.
Error
Traceback (most recent call last):
File "....py", line ..., in test_with_pickling
u_wizard_merlin.spells[u_wizard_merlin.spells.keys()[1]][0].target)
KeyError: Merlin
Solution
Below I present one solution to this problem via means of automatically restoring dictionary or set keys after unpickling; for clarity I have not inherited directly from the built-in types (e.g. dict
, set
, and so on), rather, I have wrapped the underlying structures and provided interfaces conforming with the Collections Abstract Base Classes. Design issues such as 4712 and 26897 mean that the workarounds required for directly sub-classing dict
and maintaining pickle behavior make the outcome ridiculously fragile and error-prone.
We may be tempted to restore the contents of a collection in __setattr__
, however, this would not work: there may be keys in our collection which have not been restored by the unpickling yet, and their hash value may change after we perform the restoration. Therefore, we must perform the restoration after the unpickling algorithm has finished with all objects.
There are two ways of restoring collections after unpickling: either we keep track of which collections need restoration and restore them after unpickling, or we mark all such collections and intercept attribute access to their wrapped contents the first time they are accessed, restoring them as needed. I have chosen the latter strategy because it does not require a separate unpickling step nor a tracking structure, and because it is lazy in the sense that a collection will be restored only at the point of time it is used. Of course, the restoration logic can easily be extracted and invoked eagerly at any time after unpickling and prior to use.
Abstract Restorable
This mix-in type is responsible for maintaining the restoration requirement marking of a collection, as well as intercepting access to the wrapped _contents
attribute and delegating to the concrete restoration method _restore
. This restoration method is expected to be handed the entirety of the information it needs from the _restoration_data
attribute, which must be made available by the subclass __setstate__
method. Typically, the subclass __getstate__
generates this information which is given to __setstate__
during unpickling.
The __setstate__
method, other than performing the default object state setting on __dict__
, marks the object as requiring restoration. Note that __init__
uses the False
mark, because restoration is necessary only for unpickled objects.
If you inherit this class and override __setstate__
then you will have to ensure that the _requires_restoration
marker is set to True
. Furthermore, note that _restore
is likely to access the object and the wrapped collection, so in order to avoid indefinite recursion we must set the marker to False
before delegating to _restore
. Finally, the _restoration_data
attribute is removed after restoration so no pointers remain to keys which may later be removed as part of normal collection use:
|
|
Restorable Dictionary
We wrap a normal dict
constructed by passing our arguments directly to it, inherit from Restorable
and implement the __getstate__
, __setstate__
, and _restore methods, as well as provide an interface to the wrapped dictionary via means of MutableMapping
. Our restoration method simply updates the wrapped dictionary with the keys and values from the restoration data. Since unpickling is finished, the keys will now be placed in the dictionary’s C structure with their correct hash value:
|
|
Restorable Default Dictionary
Now that we have a restorable dictionary, we can extend it in order to support defaultdict
. We need only override the state-related methods, since we must now also persist the default_factory
attribute:
|
|
Restorable Ordered Dictionary
Other than setting _contents
to be an OrderedDict
, we don’t need to do anything else because the RestorableDict
restoration follows the order of iteritems
, which ensures proper ordering in our case:
|
|
Restorable Set
Finally, for set
, things are even simpler, as the _restoration_data
consist of all elements of the set, and are used to directly update the set in _restore
:
|
|
Conclusion
We’ve worked around an inherent limitation of Python’s pickle protocol at the cost of more complexity as well as a performance hit. Specifically, we must now maintain restorable versions of collections, as well as accept a slow-down of unpickling and access to the wrapped collection.
In terms of performance, both slow-downs are a constant on top of their respective operations, and should not be prohibitive for small collections. If the collections involved are large and have many cyclic structures, then perhaps the question no longer is whether we can pickle effectively but whether we should be using pickled data or Python at all. Perhaps a dedicated graph database and query language are better suited.
With respect to maintenance complexity, it is highly likely that a lot of code will not accept these restorable collections if it makes any kind of type checking using dict
or set
instead of, respectively, Mapping
and Set
; this is a more general issue with the appropriate use of Python’s Collection Abstract Base Classes rather than our particular workaround.
Acknowledgments
Gratitude to Dr Anastasia Niarchou for her feedback.