Daily Archive for August 15th, 2008

Ordinal Serialization

Serialization

Serialization is the process of turning an in-memory non-linear data structure into a sequence of bytes that can be used, at another place or another time, to construct the original in-memory data structure. It’s the equivalent of putting the data structure in a box before moving to a new house (or to keep it in the basement for later use).

Today, serialization is a solved problem. Most programming languages provide serialization as a standard library feature (PHP’s serialize, OCaml’s Marshal module) or as a common extension (C++’s boost::serialization). These packages solve countless problems encountered when serializing arbitrary data: handling collections, handling polymorphism, handling object graph joins or cycles, and even the low-level issues of binary serialization, such as endian-independent memory layouts. There is, as such, no reason to use your own serialization method without first trying out these methods.

Ordinal serialization

These methods, however, have the issue of being as general as possible. As such, they often generate serialized data that is too large when compared to the actual data being stored. This is of course perfectly normal: without prior knowledge of the data set, an algorithm cannot be optimized for either size or speed. For instance, if the programmer knows that the object graph is a tree (no joins or cycles), the serializer will still have to place additional data in the stream to tell the deserializer that no joins or cycles were encountered.

A such example of overhead applies to indexes. In general, indexes are numbers between zero and the size of a container, and they are used quite a fair lot in data structures. So, if the container has a size of 172, all indices fit in a single 8-bit byte, yet will be represented by most binary serialization systems as a four-byte or eight-byte value in memory, leaving it to a compression algorithm down the pipeline to compress the three unnecessary “zero” bytes away.

If ordinals are assumed to be close to zero with a higher probability, then a possible optimization is to pack as many ordinals in a single byte, then in the following byte, and so on. The memory representation thus uses, for each byte, 7 bits to represent the value and 1 bit to determine whether this is the last byte of the ordinal or if another byte exists. This method also has the advantage of eliminating any endianness issue, since it uses individual bytes.

The code

The piece of code below provides two functions for writing and reading an ordinal (represented as an unsigned number) from an iterator. The iterator’s value_type should be unsigned char (or be compatible with it). Possibilities would be an std::back_insert_iterator< std::vector<unsigned char> > or an std::ostream_iterator< unsigned char > over the appropriate output stream.

namespace ordinal
{
    template <typename It>
    void put(unsigned value, It &target)
    {
        while (value > 127)
        {
            *target++ = static_cast<unsigned char>(value % 128) + 128;
            value /= 128;
        } 

      *target++ = static_cast<unsigned char>(value);
    } 

    template <typename It>
    unsigned get(It &source)
    {
        unsigned char read;
        unsigned multiplier = 1;
        unsigned result = 0;
        while ((read = *source++) > 127)
        {
            result += (static_cast<unsigned>(read) - 128) * multiplier;
            multiplier *= 128;
        } 

        result += static_cast<unsigned>(read) * multiplier; 

        return result;
    }
}

This is a fairly common trick, encountered in countless file format specifications (take a peek at some of the binary formats for 3D models at wotsit and see how they represent endian-independent indices). The code above is hereby placed in the public domain, where allowed by law.



680 feed subscribers
(readers who polled a feed this week)