Serialization - storing and reloading objects

Main Content

Serialization - storing and reloading objects

Archive - Originally posted on "The Horse's Mouth" - 2009-10-04 07:45:01 - Graham Ellis

In most of the languages we teach, data is held in memory on a "heap" with a "symbol table" holding the names of the variables, where they are stored, and what type of information the (currently) contain. When you write simple variables out to a file (or the screen) functions like print or puts (Tcl) format the data in such a way that it's written out in useful form. There is, however, something of a problem when you try to do that with an object. Unless it has been told, your computer doesn't know how to display or save an object in a way appropriate for its later reuse.

Saving an object to disc for later reloading

The very term "heap" should give you a clue about something. If data is stored in "a heap" then it's not going to be all neat and tidy, is it ;-) ?? It will be held in a well defined structure that makes it accessible as necessary, but thet structure will include all sorts of memory address pointers. That means that if you just store a variable's content it won't be practical for another program to read it back later, because:
• you won't have saved the memory locations as well as the data, so the restoring program won't know what points where
• even if you had saved the memory locations, it's almost certain that when you come to restore the object there will be something else at those memory locations, or they will be within the address space of a different program.

There is often a solution provided - built in to the language - which provides you with a way of adding further information as you store your object so that you can restore it later. The generic term for this is serialisation - which means turning the object into a stream on bytes that can be stored in a serial fashion on a disc, or indeed sent down a serial connection such as a pipe to another process, or even over a network connection to another computer.

In PHP, you can provide a method called __serialize in your class definition which defines how an object should be formatted into a stream suitable for transfer over a serial line (where the name came from) of writing to disk, and a method called unserialize which reverses the process.

In Java, you state that your class implements serializable and that ensures that extra work is done within the class, and is stored. It's necessary to ensure that any classes which are used within the object are also serialisable. There's an example class (source code) here and the code that reads and writes objects to file here.

In Python, the pickle, cpickle and marshal modules all provide ways of serialising an unserialising data. There's more on that in a previous article

Cleaning up before you serialise

Where you have a complex object, it's likely that you'll have stored intermediate calculations within it. That's caching partly calculated results to avoid the need for repeated recalculation. But such intermediate data need not be stored, and may take up a lot of disc space, so you'll want to dump it before you save to disc or other serial stream.

In Java, you declare your variablse as transient if you don't want to save their contents as part of the object.

In PHP, the __sleep method which you can write performs a tidy up operation, and the __wakeup routine can be used on reload.