A Modern C++ and Unicode primer

Without doubt, the near-universal acceptance of Unicode has been one of the globalized software success stories of recent years. With its availability, internationalization (i18n) is no longer an esoteric topic, while localization (l10n) of software can, in many cases, be performed without either recompilation of, or even access to, the source-code.

The Unicode Standard (UCS-4) defines slightly over a million code points, which are often written as hexadecimal (eg. U+20AC for the Euro currency symbol). A number of encodings exist in 32, 16 and eight-bit forms, in both big- and little-endian (they are: UTF-32BE, UTF-32LE, UTF-16BE, UTF-16LE and UTF-8). The UTF-8 (Unicode Transformation Format – Eight Bit) encoding is possibly the most common and can encode any Unicode code point (from either UCS-2/16 bit or UCS-4/32 bit) into a code sequence of between one and four bytes in length.

The C++20 Standard introduced a new type char8_t which is intended to hold a single element (code unit) of a UTF-8 code sequence. This removes any ambiguities that may arise from using plain char; char8_t is always eight bits and unsigned, and should not be used to hold other forms of encoded or raw data. To complement this new type, there are new string (std::u8string), string literal (u8"...") and character literal (u8'.') entities available (std::u8string is actually just a template specialization as std::basic_string<char8_t>).

The Unicode literal syntax of Modern C++ can be used to specify UTF-8 code sequences within string literals without the use of a UTF-8 compatible editor (although many coding environments do support UTF-8 for editing source code). The two forms are: \uABCD and \U00ABCDEF and are of fixed length so that other text can follow directly without ambiguity to the compiler, so u8"\u20AC0.50" is €0.50 as exactly four hexadecimal digits follow \u.

In fact, UTF-8 strings need not contain any top-bit-set characters, and such strings are known as UTF-7, having all code points represented by a single code unit. A UTF-7 code point is (deliberately) identical to the 7-bit ASCII encoding. Also, some eight-bit values, notably 0xFF, never appear as code units in a well-formed UTF-8 string, so could be usefully used as end-of-string markers, if required.

Unicode literals can be used within string literals, and these strings can then be manipulated or output to streams. Note however that the C++ Standard does not specify how Unicode string objects are put to the stream output objects std::cout/std::wcout; under modern Linuxes your console probably uses a UTF-8 encoding by default, while under Windows it may be necessary to issue a chcp 65001 command to set the UTF-8 code page for a running console session. The following code embeds code point U+20AC at the beginning of two string literals:

string s1{ "\u20AC1.00" };     // "€1.00"
u8string s2{ u8"\u20AC2.00" }; // "€2.00"
    
cout << s1 << '\n';
cout << reinterpret_cast<const char*>(s2.c_str()) << '\n';

As shown above, s1 may not display correctly under Windows even with the correct code page selected, while for s2 a cast is necessary in order to avoid an error similar to “no match for operator<<“. It is possible to define a suitable global operator<< to perform the same reinterpret_cast for all u8string objects:

inline std::ostream& operator<<(std::ostream& os, const std::u8string& str) {
    return os.write(reinterpret_cast<const char*>(str.data()), str.size());
}

Another issue is manipulating a std::u8string where the elements have varying lengths (some languages mandate that such strings are immutable, or read-only, while C++ does not). Single code unit sequences encode 7-bits, two-sequences encode 11-bits, three-sequences encode 16-bits and four-sequences encode all possible 21-bits of UCS-4. It is possible to know the length of a well-formed UTF-8 code sequence from the value of the first code unit; a function to calculate the length of a well-formed UTF-8 string in the style of strlen() from the C Standard Library is shown below:

int u8strlen(const char8_t *s) {
    int len{ 0 };
    while (*s) {
        if (*s < 0b1000'0000) { // 7-bit code unit
            ++len;
            ++s;
        }
        else if (*s < 0b1100'0000) { // continuation byte in this context is invalid
            return -1;
        }
        else if (*s < 0b1110'0000) {
            ++len;
            s += 2;
        }
        else if (*s < 0b1111'0000) {
            ++len;
            s += 3;
        }
        else if (*s < 0b1111'1000) {
            ++len;
            s += 4;
        }
        else { // out of range for code unit
            return -1;
        }
    }
    return len;
}

The return value of -1 is for strings which are obviously ill-formed, however checks for all continuation bytes having valid values of 0b10xxxxxx are not carried out by this code, so in production code you should use something more bullet-proof. Similarly for indexing into a std::u8string, counting from the beginning (or end, if you are careful and only need a negative offset) is always necessary and the return type should be char32_t to allow for all possible encoded values. The possibility of encountering a byte order mark (or BOM, code point U+FEFF, UTF-8 sequence EF BB BF) at the start of the string should also be considered.

Under Windows, the wchar_t type is considered to hold a single UTF-16 code unit (that is either a single UCS-2 code point, or half of a code sequence for a UCS-4 code point above U+10000). This means that UCS-2/UCS-4 encoded as UTF-16 can be passed directly to std::wcout and other “wide character” stream objects (so long as a cast from char16_t* to wchar_t* is performed, see above). Under Linux, wchar_t is a platform-defined size, probably 32 bits, however it is unlikely to hold a UCS-4 value. Conversion functions to/from char32_t may be available, and may again be a 1:1 mapping; output to the console using the std::wcout family is possible but should be tested as working correctly.

Often Unicode text will be held in memory for processing or editing using char32_t, but then the issue of outputting to console, GUI or disk file, involves character set conversion to char16_t/wchar_t or char8_t/char. You will almost certainly want to use a library rather than writing your own functions for such operations; the International Components for Unicode website is a good place to start. (As of C++20 the Standard Library conversion functions using std::codecvt are deprecated as they are vulnerable to some malformed multi-byte code sequences.)

2 thoughts on “A Modern C++ and Unicode primer”

Michel March 1, 20228:56 pm Reply

:s/compliment/complement

LikeLike
1. cpptutor March 2, 20228:51 am Reply
  
  Hi Michel,
  
  Many thanks for the correction, I’m always looking to improve readability. I’ve now updated the article.
  
  LikeLike

A Modern C++ and Unicode primer

Published by cpptutor

2 thoughts on “A Modern C++ and Unicode primer”

Leave a comment Cancel reply

Share this:

Published by cpptutor

2 thoughts on “A Modern C++ and Unicode primer”

Leave a comment Cancel reply