Advanced Python: Pickle, Shelve. Pickling 'unpickable' objects

Pickle:
+ A module supports writing objects to file, more about pickle can be found at Python docs.
+ Simple example: dump a dictionary into file

d = {'a':1, 'b':2}
# open file with write - binary mod 
f = open('data.pkl', 'wb')
import pickle
# dump d into file f
pickle.dump(d, f)
f.close()

# read back d from f
f = open('data.pkl', 'rb')
e = pickle.load(f)
print(e)
# print out: {'a':1, 'b':2}

+ Another example: dump a list into file

import pickle
# list contain a list, string and a number
some_data = [['a', 'list'], 'a string', 5]
with open('pickle_list','wb') as f:
    pickle.dump(some_data, f)
# reload dumped list 
with open('pickle_list', 'rb') as f:
    loaded_data = pickle.load(f)
print(loaded_data)
# no error popup
assert some_data == loaded_data

How it works:
+ When pickle tries to dump (serialize) object, it simply tries to store object’s __dict__ attribute. __dict__ is a dictionary mapping all the attribute names on the object to their values. Before checking __dict__, pickle checks to see whether a __getstate__ method exists. If it does, it will store the returned value of that method instead of __dict__.
+ But some objects that are ‘unpickable’, example: open network socket, open file, running thread, database connection stored as an attribute of an object.

Example: Supposed you have an URL that automatically update after every one hour. You implement it with a class call UpdatedURL, this class has 4 attribute:
+ url: the url
+ content: content when you open it in browser
+ last_updated: the last time the url was updated
+ timer: Timer object, start the schedule
The objects of class UpdatedURL are unpickable because of the timer attribute (running thread). So the solution here is to remove it before pickling and re-initialize it (get back the timer) after unpickling it from file.

from threading import Timer
import datetime
from urllib.request import urlopen

class UpdatedURL:
    def __init__(self, url):
        self.url = url
        self.contents = ''
        self.last_updated = None
        self.update()
    
    def update(self):
        self.contents = urlopen(self.url).read()
        self.last_updated = datetime.datetime.now()
        self.schedule()

    def schedule(self):
        self.timer = Timer(3600, self.update)
        self.timer.Daemon(True)
        self.timer.start()

    # pickle use this for pickling
    def __getstate__(self):
        new_state = self.__dict__.copy()
        if 'timer' in new_state:
            del new_state['timer']
        return new_state

    # and while unpickling, we get back the timer (call schedule())
    def __setstate___(self, data):
        self.__ditct__ = data
        self.schedule()

+ __setstate__ method can be implemented to customize unpickling. This method accepts value returned by __getstate__, which is a dictionary.

Shelve
+ shelve uses pickle to convert object into byte string, an associate that object with a key.
+ Example: Suppose we have a class Person, with three attributes: name, job, payment. We create three Person objects, bob, sue, tom and we want to write it in a way that we can get them back by key. Shelve works as a small database (dictionary) with key corresponding to each object that we dump into file.

import shelve
# write
db = shelve.open('persondb') # create a file name persondb.db
for obj in (bob, sue, tom):
    db[obj.name] = obj # associate key - obj
db.close() # force pushing all data (flush) into file

# read
db = shelve.open('persondb')
for key in shelve.keys(): # dictionary's interface
    print(shelve[key]) 

+ Under the hood, when instances are shelved or pickled, the underlying pickling system records both instance attributes and enough information to locate their class automatically when they are fetched.

Advertisements