Dr Alexis Petrounias

Information Systems Engineer

Europe

Pickling Python collections with non-built-in type keys and cycles

written on in categories Django Programming

A presentation of various problems and an effective solution for pickling Python collections with non-built-in types as keys and cycles or self references. Applies to dictionaries, default dictionaries, ordered dictionaries, sets, and ordered sets, at all levels of the pickle and cPickle protocol, in Python 2 and 3.

The entirety of the implementation is available as an open source package python-restorable-collections.

Motivation

In many object compositions in Python we effectively create pointer graphs because an object ‘contained’ in another object maintains a pointer back to its ‘container’; the typical example being a node in a tree with pointers to its children which in turn have pointers back to the parent node. Sometimes an object may even point to itself, such as when modeling relations in a system which allows self-relations. Further still, sometimes nominal types (defined through a class) are better suited as keys in dictionaries or entries in sets rather than having to use only built-in types corresponding to the objects.

Problem

The hash value of an instance object is necessary for the object to be used in any data structure which looks up objects based on hashes, including dict, OrderedDict, and set. Crucially, the hash value must never change after the object is used as a key, because Python’s dictionary implementation stores the hash value in a C structure behind the scenes and will not update (rehash) this structure should __hash__ return different values during the lifetime of the object.

In our example below, the implementation of the __hash__ method relies on the value of a name field, however, during unpickling, the __dict__ of the instance object - and hence the field name - will not be available until __setstate__ has completed (either a custom version of this method defined in the class, or the built-in behavior which simply restores the instance object’s pickled state as the object’s __dict__). Our fallback will be to return the default hash value for non-built-in objects, namely their id. We cannot rely on protocol 2 __getnewargs__ either, because it will not have access to the state and is only used for supplying internal invariant data primarily useful for __new__.

When the unpickling algorithm needs to use an object as a key, therefore, it will invoke __hash__ and internally store this value in the C structure implementing the lookup for the dictionary or set; usually, this will happen only after the instance object has already been restored via an invocation to __setstate__ (or by default state restoration), because unpickling is bottom-up, in the sense that objects' states are restored prior to them being placed in other data structures.

However, there are a set of circumstances which make it impossible for the unpickling algorithm to restore the state prior to placing the object as a key in a dictionary or set: when the collection is an attribute of the very object being restored and the object itself is used as a key, or when two objects mutually form a reference cycle through their collection attributes. In these situations, the hash value (of at least one object) is needed prior to the collection attribute preparation.

Discussion

Obviously this inability to unpickle certain heap configurations is not a flaw in the language specification nor a bug in its implementation, although it could be argued that pickling and unpickling should always restore the same heap shape if that heap was valid in the first place. External resources, such as sockets and files, by nature represent state and must be manually restored or recreated. However, Python objects not associated with the aforementioned resources are stateless in the light of the never-changing hash value requirement, and should be fair game for automatic restoration irrespective of their pointer graph.

Furthermore, discussion of other dictionary issues such as 1475692 and 1761028 makes it abundantly clear that there are certain performance characteristics of the C structures involved which necessitate maintaining these implementation decisions, and since they have been carried over to Python 3 they are here to stay with us for a long time.

One way of automatically solving this problem would be for the unpickling algorithm to maintain a look-aside table of allocated objects whose hash value was required prior to their state being restored, and then internally rehashing all C structures which contain them. Naturally there would be a performance hit, but it would be one-off during unpickling, and may well be preferable to refactoring code so it uses only built-in keys and needs to perform translation from the built-in key value to the object of interest. Designing or proposing an update to the pickle protocol is beyond the scope of this article, but I believe is worthwhile of future consideration.

Example

Our example consists of a World in which Wizard instances cast Spell instances on each other in epic battles of magic and so on. Some wizards may chose to also cast protective spells on themselves, which means that the caster and the target of the spell can be the same object. Cast spells are recorded with each wizard, along with the order in which they were cast, in the spells dictionary which maps the spell’s target to the instance of the spell.

This creates the possibility for a wizard to contain a dictionary with herself as the key. Finally, note that in order for a wizard to be usable as a key the class must implement __hash__ which relies on the hash of the name string with a fall-back to id:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from collections import OrderedDict

class World(object):

    def __init__(self):
        super(World, self).__init__()
        self.wizards = []

class Wizard(object):

    def __init__(self, world, name):
        self.name = name
        self.spells = OrderedDict()
        world.wizards.append(self)

    def __hash__(self):
        return hash(self.name) if hasattr(self, 'name') else id(self)

class Spell(object):

    def __init__(self, caster, target, name):
        super(Spell, self).__init__()
        self.caster = caster
        self.target = target
        self.name = name
        if not target in caster.spells:
            caster.spells[target] = []
        caster.spells[target].append(self)

    def __repr__(self):
        return """<Spell {caster} : {target} : {name} ...>""".format(
            caster = self.caster, target = self.target, name = self.name)

Testing Without Pickling

The following execution works just fine and all tests pass. If most of these tests appear to be asserting the obvious, it is because they serve to demonstrate where the problem with unpickling lies:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class MagicTestCase(TestCase):
    def test_without_pickling(self):
        world = World()
        wizard_merlin = Wizard(world, 'Merlin')
        wizard_morgana = Wizard(world, 'Morgana')
        spell_a = Spell(wizard_merlin, wizard_morgana, 'magic-missile')
        spell_b = Spell(wizard_merlin, wizard_merlin, 'stone-skin')
        spell_c = Spell(wizard_morgana, wizard_merlin, 'geas')

        self.assertEqual(wizard_merlin.spells[wizard_morgana][0], spell_a)
        self.assertEqual(wizard_merlin.spells[wizard_merlin][0], spell_b)
        self.assertEqual(wizard_morgana.spells[wizard_merlin][0], spell_c)

        # Merlin has cast Magic Missile on Morgana, and Stone Skin on himself
        self.assertEqual(wizard_merlin.spells[wizard_morgana][0].name,
            'magic-missile')
        self.assertEqual(wizard_merlin.spells[wizard_merlin][0].name,
            'stone-skin')

        # Morgana has cast Geas on Merlin
        self.assertEqual(wizard_morgana.spells[wizard_merlin][0].name, 'geas')

        # Merlin's first target was Morgana
        self.assertTrue(wizard_merlin.spells.keys()[0] in wizard_merlin.spells)
        self.assertEqual(wizard_merlin.spells.keys()[0], wizard_morgana)

        # Merlin's second target was himself
        self.assertTrue(wizard_merlin.spells.keys()[1] in wizard_merlin.spells)
        self.assertEqual(wizard_merlin.spells.keys()[1], wizard_merlin)

        # Morgana's first target was Merlin
        self.assertTrue(wizard_morgana.spells.keys()[0] in wizard_morgana.spells)
        self.assertEqual(wizard_morgana.spells.keys()[0], wizard_merlin)

        # Merlin's first spell cast with himself as target is in the dictionary,
        # first by looking up directly with Merlin's instance object...
        self.assertEqual(wizard_merlin,
            wizard_merlin.spells[wizard_merlin][0].target)

        # ...and then with the instance object directly from the dictionary keys
        self.assertEqual(wizard_merlin,
            wizard_merlin.spells[wizard_merlin.spells.keys()[1]][0].target)

        # Ensure Merlin's object is unique...
        self.assertEqual(id(wizard_merlin), id(wizard_merlin.spells.keys()[1]))

        # ...and consistently hashed
        self.assertEqual(hash(wizard_merlin),
            hash(wizard_merlin.spells.keys()[1]))

Testing With Pickling

Now we pickle the world and extract the same object references from the unpickled version of the world. Several of the tests fail (marked with a highlight below), specifically all those which use Merlin as a key, including when his object is obtained directly from the keys method of the dictionary. If you want to run the successful tests, then you’ll have to remove or comment out the ones which fail:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
class MagicTestCase(TestCase):
	def test_with_pickling(self):
        world = World()
        wizard_merlin = Wizard(world, 'Merlin')
        wizard_morgana = Wizard(world, 'Morgana')
        spell_a = Spell(wizard_merlin, wizard_morgana, 'magic-missile')
        spell_b = Spell(wizard_merlin, wizard_merlin, 'stone-skin')
        spell_c = Spell(wizard_morgana, wizard_merlin, 'geas')

        self.assertEqual(wizard_merlin.spells[wizard_morgana][0], spell_a)
        self.assertEqual(wizard_merlin.spells[wizard_merlin][0], spell_b)
        self.assertEqual(wizard_morgana.spells[wizard_merlin][0], spell_c)
        _world = pickle.dumps(world)
        u_world = pickle.loads(_world)
        u_wizard_merlin = u_world.wizards[0]
        u_wizard_morgana = u_world.wizards[1]

        # Merlin has cast Magic Missile on Morgana, and Stone Skin on himself
        self.assertEqual(u_wizard_merlin.spells[u_wizard_morgana][0].name,
            'magic-missile')
        self.assertEqual(u_wizard_merlin.spells[u_wizard_merlin][0].name,
            'stone-skin')

        # Morgana has cast Geas on Merlin
        self.assertEqual(u_wizard_morgana.spells[u_wizard_merlin][0].name,
            'geas')

        # Merlin's first target was Morgana
        self.assertTrue(
            u_wizard_merlin.spells.keys()[0] in u_wizard_merlin.spells)
        self.assertEqual(u_wizard_merlin.spells.keys()[0], u_wizard_morgana)

        # Merlin's second target was himself
        self.assertTrue(
            u_wizard_merlin.spells.keys()[1] in u_wizard_merlin.spells)
        self.assertEqual(u_wizard_merlin.spells.keys()[1], u_wizard_merlin)

        # Morgana's first target was Merlin
        self.assertTrue(
            u_wizard_morgana.spells.keys()[0] in u_wizard_morgana.spells)
        self.assertEqual(u_wizard_morgana.spells.keys()[0], u_wizard_merlin)

        # Merlin's first spell cast with himself as target is in the dictionary,
        # first by looking up directly with Merlin's instance object...
        self.assertEqual(u_wizard_merlin,
            u_wizard_merlin.spells[u_wizard_merlin][0].target)

        # ...and then with the instance object directly from the dictionary keys
        self.assertEqual(u_wizard_merlin,
            u_wizard_merlin.spells[u_wizard_merlin.spells.keys()[1]][0].target)

        # Ensure Merlin's object is unique...
        self.assertEqual(id(u_wizard_merlin),
            id(u_wizard_merlin.spells.keys()[1]))

        # ...and consistently hashed
        self.assertEqual(hash(u_wizard_merlin),
            hash(u_wizard_merlin.spells.keys()[1]))

The third from last assertion on lines 49-50, which fails as below, is the crux of our problem. The last two assertions, which compare the identities and hashes of the objects, tell us that in fact both u_wizard_merlin and u_wizard_merlin.spells.keys()[1] are the same object. Since Merlin has cast a spell on himself, he has formed a (trivial) pointer cycle. Wizards.

Error
Traceback (most recent call last):
  File "....py", line ..., in test_with_pickling
    u_wizard_merlin.spells[u_wizard_merlin.spells.keys()[1]][0].target)
KeyError: Merlin

Solution

Below I present one solution to this problem via means of automatically restoring dictionary or set keys after unpickling; for clarity I have not inherited directly from the built-in types (e.g. dict, set, and so on), rather, I have wrapped the underlying structures and provided interfaces conforming with the Collections Abstract Base Classes. Design issues such as 4712 and 26897 mean that the workarounds required for directly sub-classing dict and maintaining pickle behavior make the outcome ridiculously fragile and error-prone.

We may be tempted to restore the contents of a collection in __setattr__, however, this would not work: there may be keys in our collection which have not been restored by the unpickling yet, and their hash value may change after we perform the restoration. Therefore, we must perform the restoration after the unpickling algorithm has finished with all objects.

There are two ways of restoring collections after unpickling: either we keep track of which collections need restoration and restore them after unpickling, or we mark all such collections and intercept attribute access to their wrapped contents the first time they are accessed, restoring them as needed. I have chosen the latter strategy because it does not require a separate unpickling step nor a tracking structure, and because it is lazy in the sense that a collection will be restored only at the point of time it is used. Of course, the restoration logic can easily be extracted and invoked eagerly at any time after unpickling and prior to use.

Abstract Restorable

This mix-in type is responsible for maintaining the restoration requirement marking of a collection, as well as intercepting access to the wrapped _contents attribute and delegating to the concrete restoration method _restore. This restoration method is expected to be handed the entirety of the information it needs from the _restoration_data attribute, which must be made available by the subclass __setstate__ method. Typically, the subclass __getstate__ generates this information which is given to __setstate__ during unpickling.

The __setstate__ method, other than performing the default object state setting on __dict__, marks the object as requiring restoration. Note that __init__ uses the False mark, because restoration is necessary only for unpickled objects.

If you inherit this class and override __setstate__ then you will have to ensure that the _requires_restoration marker is set to True. Furthermore, note that _restore is likely to access the object and the wrapped collection, so in order to avoid indefinite recursion we must set the marker to False before delegating to _restore. Finally, the _restoration_data attribute is removed after restoration so no pointers remain to keys which may later be removed as part of normal collection use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
class Restorable(object):

    def __init__(self):
        self._requires_restoration = False

    def __setstate__(self, state):
        self.__dict__ = state
        self._requires_restoration = True

    def __getattribute__(self, item):
        if item == '_contents':
            if self._requires_restoration:
                self._requires_restoration = False
                self._restore(self._restoration_data)
                self._restoration_data = None
        return object.__getattribute__(self, item)

    def _restore(self, restoration_data):
        raise NotImplementedError(
            "you must specify the _restore method with the Restorable type")

Restorable Dictionary

We wrap a normal dict constructed by passing our arguments directly to it, inherit from Restorable and implement the __getstate__, __setstate__, and _restore methods, as well as provide an interface to the wrapped dictionary via means of MutableMapping. Our restoration method simply updates the wrapped dictionary with the keys and values from the restoration data. Since unpickling is finished, the keys will now be placed in the dictionary’s C structure with their correct hash value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from collections import MutableMapping

class RestorableDict(MutableMapping, Restorable, object):

    def __init__(self, *args, **kwargs):
        self._contents = dict(*args, **kwargs)
        Restorable.__init__(self)

    def __getstate__(self):
        return [ (key, value) for key, value in self._contents.iteritems() ]

    def __setstate__(self, state):
        Restorable.__setstate__(self, {
            '_contents' : dict(),
            '_restoration_data' : state,
        })

    def _restore(self, restoration_data):
        for (key, value) in restoration_data:
            self._contents[key] = value

    def __getitem__(self, item):
        return self._contents[item]

    def __setitem__(self, key, value):
        self._contents[key] = value

    def __delitem__(self, key):
        del self._contents[key]

    def __iter__(self):
        return iter(self._contents)

    def __len__(self):
        return len(self._contents)

    def __repr__(self):
        return """RestorableDict{}""".format(repr(self._contents))

Restorable Default Dictionary

Now that we have a restorable dictionary, we can extend it in order to support defaultdict. We need only override the state-related methods, since we must now also persist the default_factory attribute:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
class RestorableDefaultDict(RestorableDict, object):

    def __init__(self, *args, **kwargs):
        self._contents = defaultdict(*args, **kwargs)
        Restorable.__init__(self)

    def __getstate__(self):
        return (self._contents.default_factory,
            [ (key, value) for key, value in self._contents.iteritems() ])

    def __setstate__(self, state):
        Restorable.__setstate__(self, {
            '_contents' : defaultdict(state[0]),
            '_restoration_data' : state[1],
        })

Restorable Ordered Dictionary

Other than setting _contents to be an OrderedDict, we don’t need to do anything else because the RestorableDict restoration follows the order of iteritems, which ensures proper ordering in our case:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
class RestorableOrderedDict(RestorableDict, object):

    def __init__(self, *args, **kwargs):
        self._contents = OrderedDict(*args, **kwargs)
        Restorable.__init__(self)

    def __setstate__(self, state):
        Restorable.__setstate__(self, {
            '_contents' : OrderedDict(),
            '_restoration_data' : state,
        })

Restorable Set

Finally, for set, things are even simpler, as the _restoration_data consist of all elements of the set, and are used to directly update the set in _restore:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
class RestorableSet(MutableSet, Restorable, object):

    def __init__(self, *args):
        self._contents = set(*args)
        Restorable.__init__(self)

    def __getstate__(self):
        return list(self._contents)

    def __setstate__(self, state):
        Restorable.__setstate__(self, {
            '_contents' : set(),
            '_restoration_data' : state,
        })

    def _restore(self, restoration_data):
        self._contents.update(restoration_data)

    def __contains__(self, x):
        return x in self._contents

    def __iter__(self):
        return iter(self._contents)

    def __len__(self):
        return len(self._contents)

    def add(self, value):
        return self._contents.add(value)

    def discard(self, value):
        return self._contents.discard(value)

    def __repr__(self):
        return """RestorableSet{}""".format(repr(self._contents))

Conclusion

We’ve worked around an inherent limitation of Python’s pickle protocol at the cost of more complexity as well as a performance hit. Specifically, we must now maintain restorable versions of collections, as well as accept a slow-down of unpickling and access to the wrapped collection.

In terms of performance, both slow-downs are a constant on top of their respective operations, and should not be prohibitive for small collections. If the collections involved are large and have many cyclic structures, then perhaps the question no longer is whether we can pickle effectively but whether we should be using pickled data or Python at all. Perhaps a dedicated graph database and query language are better suited.

With respect to maintenance complexity, it is highly likely that a lot of code will not accept these restorable collections if it makes any kind of type checking using dict or set instead of, respectively, Mapping and Set; this is a more general issue with the appropriate use of Python’s Collection Abstract Base Classes rather than our particular workaround.

Acknowledgments

Gratitude to Dr Anastasia Niarchou for her feedback.