Designing Classes for Serialization (2)

The complement of serialization is deserialization, and this is typically more difficult as it requires character- or pattern-matching of the input, plus checking for erroneous input. To match a stream of XML the possibility of using std::getline() was considered, but this is too inflexible as regards whitespace. Stream input to a std::string was also considered, but again this is unsuitable as breaking all input at whitespace is not what is needed.

A suitable custom input function read_until() takes as input a reference to a std::istream, a character to break at (in our case always < or >) and two flags: one whether to “eat” the last character of input, and the other whether to skip whitespace. The definition (as a private static function of class XMLElement) looks like this:

class XMLElement {
// private member variables ...
    static std::string read_until(std::istream &is, char c, bool eat, bool skipws) {
        std::string str;
        auto ch = is.get();
        while (!is.eof() && (static_cast<char>(ch) != c)) {
            if (!skipws || ch > ' ') {
                str += static_cast<char>(ch);
            }
            ch = is.get();
        }
        if (eat) {
            if (!skipws || ch > ' ') {
                str += static_cast<char>(ch);
            }
        }
        else {
            is.putback(ch);
        }
        return str;
    }
// rest of class definition ...

Lines 6-11 are the main loop where characters are added to variable str. Checking for whitespace is performed based on parameter variable skipws. If parameter variable eat is true, the final character read is added to the string at line 14, otherwise it is put back into the input stream at line 18. Note that ch is an integer value so needs to be cast to char. Such flexibility with eat and skipws may seem like overkill, but we will use all four combinations of these booleans.

To handle erroneous input we need to be able to throw an exception. Here is a minimal class definition for an exception class able to be caught with catch(std::exception &e) { ... } which takes a std::string as the error message:

class XMLError : public std::exception {
    std::string str;
public:
    XMLError(std::string_view s) : str{ s } {}
    const char *what() const noexcept override { return str.c_str(); }
};

Line 5 allows us to use the inherited what() member function to output the stored error message in the catch block.

Now we are ready for the interesting part, writing a constructor for XMLElement() which takes a std::istream reference parameter (only) and calls itself recursively in order to create the hierarchy:

class XMLElement {
// private members ...
public:
// other constructors ...
    XMLElement(std::istream& is) {
        std::string str;
        str = read_until(is, '<', false, true);
        if (str.empty()) {
            str = read_until(is, '>', true, false);
            const std::regex begin_elem{ R"(<([A-Za-z_][A-Za-z0-9_]*)((\n|\r|.)*)>)" },
                end_elem{ R"((/>)|(</([A-Za-z_][A-Za-z0-9_]*)>))" },
                attr{ R"*(^\s+([A-Za-z_][A-Za-z0-9_]*)="([^"]*)")*" }, end_attrs{ R"(\s*/?$)" };
            std::smatch matches;
            if (std::regex_match(str, matches, begin_elem)) {
                name = matches[1].str();
                str = matches[2].str();
                if (!str.empty()) {
                    std::smatch attrs;
                    while (std::regex_search(str, attrs, attr)) {
                        attributes.emplace_back(attrs[1].str(), attrs[2].str());
                        str = str.substr(attrs[0].str().size());
                    }
                    if (!std::regex_match(str, end_attrs)) {
                        throw XMLError("Expected attribute instead of: " + str);
                    }
                }
                if (str.back() == '/') {
                    return;
                }
                str = read_until(is, '<', false, false);
                if (std::find_if(str.cbegin(), str.cend(), [](char c){ return c > ' '; }) != str.cend()) {
                    value = str;
                    str = read_until(is, '>', true, false);
                    if (std::regex_match(str, matches, end_elem)) {
                        if (matches[3].str() == name) {
                            return;
                        }
                    }
                }
                else {
                    for (;;) {
                        children.push_back(std::make_unique<XMLElement>(is));
                        str = read_until(is, '<', true, true);
                        if (str != "<") {
                            break;
                        }
                        char c;
                        is >> c;
                        if (c == '/') {
                            str = "</" + read_until(is, '>', true, false);
                            if (std::regex_match(str, matches, end_elem)) {
                                if (matches[3].str() == name) {
                                    return;
                                }
                            }
                            break;
                        }
                        is.putback(c);
                        is.putback('<');
                    }
                }
            }
        }
        throw XMLError("Bad input: " + str);
    }
// rest of class definition ...

In the code above, line 7 reads from the input stream up to (but not including) the opening <, which should be the first non-whitespace character. If the read is non-empty then we must have encountered garbage, so control-flow falls through to the exception throw at line 64. Line 9 reads all characters up to the next >, which is the opening tag. This is then matched against line 10’s multi-line regex at line 14, with the first partial match being the tag name, and the second (optional) partial match being the attribute(s). Lines 19-22 add the attribute(s) (if any) matched from line 12’s attr regex to the attributes member array one by one, shrinking the match string from the front. Line 23 checks there was no additional input other than an optional closing /, which if present causes a return from the constructor at line 28.

Line 30 reads all characters including whitespace up to the next <, which would be the element’s value. If any non-whitespace character is found by line 31, line 32 assigns to member value, and looks for a correctly named closing tag at lines 33-35, returning from the constructor if found, or falling through if not found.

In the case where an opening tag is found instead of a value, lines 41-60 handle the recursion with a loop. Line 42 attempts to make a XMLElement sub-tree from the rest of the input stream, which is appended to member children. Line 43 reads the next < tag, falling through to the exception throw at line 64 if erroneous input is found. Some slightly hacky look-ahead at lines 47-51 is then needed to determine when a closing tag is found. If the name matches the current element name, then the constructor returns, otherwise control-flow falls through to the exception throw at line 64. Lines 58-59 clean up after the lookahead from the beginning of the loop.

The main() program can simply read from std::cin and output to std::cout with sample input redirected from the outputs of the program from the first part of this mini-series. The output should be identical to the input (apart from whitespace issues):

int main() {
    try {
        XMLElement xml_doc(std::cin);
        xml_doc.serialize(std::cout);
    }
    catch (std::exception& e) {
        std::cerr << e.what() << '\n';
    }
}

That wraps up this article on deserialization, in the next article of this mini-series we’ll look at querying the in-memory model (the “XML DOM”) with overload(s) of operator []. Until then you may wish to experiment with sample inputs, including those with deliberate mistakes, in order to test this constructor function.

Leave a comment