Designing Classes for Serialization (1)

Data held in memory must be saved to a store in order to outlive the running-time of the program it belongs to. Simply dumping binary data to disk is not usually the best way, or even a practicable way, of achieving this (think security issues, endianness, 32-bits/64-bits etc.)

The terms “serialization” and its companion “deserialization” are used to describe the conversion between the in-memory model and a (usually textual) disk-file representation. As an example, consider a web-browser which reads an HTML web page (and its dependencies) from a network connection and converts (deserializes) it into a format which can be manipulated through the DOM (Document Object Model). In this article we’ll concentrate more on serialization, which is often a simpler task.

The XML (eXtensible Mark-up Language) format was developed from HTML and SGML as a format for data transfer and storage. Modern software applications such as LibreOffice use XML as their primary saving format, as do more recent variants of Microsoft Word (.docx). An XML file consists of a header (which defines the text encoding, often UTF-8) and a root element which has an arbitrary number of children. Elements are named within < and >, and can have a value, children and/or attributes. Elements without children need no closing element, and end with />. Here is a sample XML file:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<People>
	<Person name="Alice" />
	<Person name="Bob" />
	<Person name="Charlie" />
</People>

The root element in this case is <People> which contains three child elements, which are indented for easier human interpretation. Another way of writing Alice (with a single child element having a value, instead of using an attribute) would be:

  <Person>
    <Name>Alice</Name>
  </Person>

With the basics of the file format understood, we can turn our attention to class design. A class XMLElement has a name and a (possibly empty) value and needs to be able to contain multiple attributes, which are string-string pairs, and also reference multiple children. Here is the simplified outline:

class XMLElement {
  std::string name, value;
  std::vector<std::pair<std::string,std::string>> attributes;
  std::vector<std::unique_ptr<XMLElement>> children;
// ...

The XMLElement constructor should always need to be provided with a name, and an optional list of attributes. Alternatively a name and value can be provided. A way of adding (with std::move) children to the XMLElement is also needed, as is a way of serializing the elements recursively.

A complete example program is shown here:

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <string_view>
#include <memory>
#include <utility>
#include <cctype>

class XMLElement;

class XMLElement {
    std::string name, value;
    std::vector<std::pair<std::string,std::string>> attributes;
    std::vector<std::unique_ptr<XMLElement>> children;
    const std::string_view indent_with = "  ";
public:
    XMLElement(std::string_view name, std::string_view value)
        : name{ name }, value{ value } {}
    XMLElement(
        std::string_view name,
        std::vector<std::pair<std::string,std::string>>&& attributes = {}
    ) : name{ name }, attributes{ std::move(attributes) } {}
    void addChild(std::unique_ptr<XMLElement> child) {
        children.emplace_back(std::move(child));
    }
    void serialize(std::ostream& os, int indent = 0) {
        for (int i = 0; i != indent; ++i) {
            os << indent_with;
        }
        os << '<' << name;
        for (const auto& a : attributes) {
            os << ' ' << a.first << "=\"" << a.second << "\"";
        }
        if (children.empty()) {
            if (value.empty()) {
                os << " />\n";
            }
            else {
                os << '>' << value << "</" << name << ">\n";
            }
        }
        else {
            os << ">\n";
            for (const auto& c : children) {
                c->serialize(os, indent + 1);
            }
            for (int i = 0; i != indent; ++i) {
                os << indent_with;
            }
            os << "</" << name << ">\n";
        }
    }
};

int main() {
    const char *fields[] = { "Name", "Age", "Location" },
        *data[] = { "Alice", "25", "New York",
            "Bob", "30", "Los Angeles", 
            "Charlie", "35", "Detroit",
            ""
        };

    XMLElement use_attr("People");

    for (auto p = data; **p;) {
        std::vector<std::pair<std::string,std::string>> attrs;
        for (int f = 0; f != 3; ++f) {
            std::string field = fields[f];
            field.front() = tolower(field.front());
            attrs.emplace_back(std::pair<std::string,std::string>{ field, *p++ });
        }
        use_attr.addChild(std::make_unique<XMLElement>("Person", std::move(attrs)));
    }
        
    use_attr.serialize(std::cout);
    
    XMLElement no_use_attr("People");
    
    for (auto p = data; **p;) {
        auto person = std::make_unique<XMLElement>("Person");
        for (int f = 0; f != 3; ++f) {
            person->addChild(std::make_unique<XMLElement>(fields[f], *p++));
        }
        no_use_attr.addChild(std::move(person));
    }
    
    no_use_attr.serialize(std::cout);
}

Line 10 is a forward declaration which is necessary since member children references the class’s own type. Line 18 is the constructor for an element with a value and no children or attributes, while line 20 is the constructor for an element with attribute(s) only, or just a name. Children can be added later with the function at line 24, but if added to an element with a value, this value will never be output. (Checking for value.empty() could possibly be added to addChild().) Use of std::unique_ptr mandates move semantics, which have the performance advantage of minimizing copying.

Lines 27-53 are the output function serialize(), which is the most complex part of the class. Indentation by two spaces for each level for child elements is provided by lines 28-30 (and 48-50 for the closing tag, if any). The name of the element is always output by line 31 followed by all the attributes (if any) at lines 32-34. Having no children means that the value (if any) and closing tag is output by line 40, or a closing /> at line 37.

Line 46 provides the recursion, adding an extra level of indentation, before outputting the closing tag at the correct indentation at lines 48-51.

In the main program, two root elements People are defined at lines 64 and 78. The ways of populating these is from the same dataset are different, the first use attributes (only) while the second uses values. The output from using attributes is:

<People>
  <Person name="Alice" age="25" location="New York" />
  <Person name="Bob" age="30" location="Los Angeles" />
  <Person name="Charlie" age="35" location="Detroit" />
</People>

This is fairly compact and may be more readable to coders familiar with JSON. However the more favoured (and traditional) way of representing the same thing using child elements with values is:

<People>
  <Person>
    <Name>Alice</Name>
    <Age>25</Age>
    <Location>New York</Location>
  </Person>
  <Person>
    <Name>Bob</Name>
    <Age>30</Age>
    <Location>Los Angeles</Location>
  </Person>
  <Person>
    <Name>Charlie</Name>
    <Age>35</Age>
    <Location>Detroit</Location>
  </Person>
</People>

Where compression is used, the difference in file size is made considerably less.

This article has shown the basis of serialization using a single function added to a class. With this technique, implementing persistence becomes a near-trivial task, and can be easily used by client classes and code, too. Deserialization is usually a more complex subject however, and we’ll look at this in the next article of this mini-series.

Leave a comment