Additional Unicode support in C++

Unicode is so prevalent these days that it is difficult to imagine any modern programming language not supporting it. C++ has made a number of attempts to provide language and library support for Unicode encodings (UTF-8/16/32), including conversions, display and manipulation, some of which are deprecated or removed. This article attempts to summarize the current state of Unicode support in C++23, and best practices to use when working with Unicode strings.

Source file encoding

It is mandated that C++ source files are encoded in UTF-8 (with or without a Byte-Order-Mark, or BOM) and also that only characters from the basic character set are used (other than within string literals). The basic character set includes most of UTF-7 (also known as 7-bit ASCII), including some control characters (whitespace), and all valid C++ code can be written using only this character set.

String literals [type]"string" (and also raw string literals [type]R"(string)") can contain any valid UTF-8 sequence for both char and char8_t string literals; this implies a need for compile-time conversion where wchar_t, char16_t or char32_t string literals are being created.

Character literals [type]'character' can optionally take the form of a (possibly multi-byte) UTF-8 sequence, or use an escape code:

Escape sequence	Meaning
`\123` `\o{123}`	Octal value ‘123’
`\xABC` `\x{ABC}`	Hexadecimal value ABC
`\u01fd` `\u{1fd}`	Unicode character U+1FD
`\U0001f34d`	Unicode character U+1F34D
`\N{LATIN SMALL LETTER A}`	Named Unicode character U+61

(Double-byte wchar_t character literals such as L'cd' are no longer supported in C++23.)

Unicode literals of the form \u20ac (€) or \U0001f34d (🍍) can be used within both char and char8_t literals ("..." and u8"..."); previously only char8_t literals could be used for Unicode characters outside of the character set for the source file.

Unicode identifiers

Just about any UTF-8 code sequence can be used for identifiers (variable names, class names, function or method names) in C++23.

#include <iostream>

float π = 3.14f;

int main() {
    std::cout << π << '\n';;
}

If the pre-processor can handle UTF-8 (an there is little reason why it shouldn’t), a “macro language” can be created to translate any C++ keyword. The following is an attempt to write the C++ “Hello, World!” program in Standardized Chinese (apologies if my use of web translation is incorrect in any way):

#include <iostream>

#define 整数      int
#define 主        main
#define 标准      std
#define 字符输出  cout

整数 主() {
    标准::字符输出 << "你好，世界！\n";
}

Character set conversions

Unfortunately there still appears to be no standardized way in C++ of converting between text encodings, other than using the deprecated (to be removed in C++26) <codecvt> header. Therefore using third-party tools such as the International Components for Unicode (recommended, see https://unicode-org.github.io/icu/download/) is necessary.

Text encoding support

Looking ahead to C++26, a new standard library header <text_encoding> will provide a type which can be queried to determine the current text encoding. The current encoding for literals can be checked at compile-time using code such as:

#include <text_encoding>
 
static_assert(std::text_encoding::literal() == std::text_encoding::UTF8);

The runtime text encoding and encoding for the default locale are also able to be queried:

std::text_encoding env_encoding = std::text_encoding::environment();
std::text_encoding locale_encoding = std::locale("").encoding();

All available encodings are listed in an enum type, a value from which is returned by member function std::text_encoding::mib(). For more details see cppreference.com.

Conclusion

In summary, C++ is continuing to add to its Unicode support in the current and upcoming versions. The UTF-8 encoding is mandated for source files from C++23 onwards, and creation of string and character literals is easy and consistent, with \u and \U escape codes and \N{...} for named Unicode entities. Under modern Linuxes outputting to std::cout usually implies a UTF-8 locale, while under Windows chcp 65001 in a command prompt selects UTF-8 as the input and output encoding for the session. (Windows also uses std::wcout as UTF-16 encoded output.) Character set conversions are more involved and require third-party libraries such as ICU.

Additional Unicode support in C++

Source file encoding

Unicode identifiers

Character set conversions

Text encoding support

Conclusion

Published by cpptutor

1 thought on “Additional Unicode support in C++”

Leave a reply to blubb Cancel reply

Source file encoding

Unicode identifiers

Character set conversions

Text encoding support

Conclusion

Share this:

Published by cpptutor

1 thought on “Additional Unicode support in C++”

Leave a reply to blubb Cancel reply