Unicode is so prevalent these days that it is difficult to imagine any modern programming language not supporting it. C++ has made a number of attempts to provide language and library support for Unicode encodings (UTF-8/16/32), including conversions, display and manipulation, some of which are deprecated or removed. This article attempts to summarize the current state of Unicode support in C++23, and best practices to use when working with Unicode strings.
Source file encoding
It is mandated that C++ source files are encoded in UTF-8 (with or without a Byte-Order-Mark, or BOM) and also that only characters from the basic character set are used (other than within string literals). The basic character set includes most of UTF-7 (also known as 7-bit ASCII), including some control characters (whitespace), and all valid C++ code can be written using only this character set.
String literals [type]"string" (and also raw string literals [type]R"(string)") can contain any valid UTF-8 sequence for both char and char8_t string literals; this implies a need for compile-time conversion where wchar_t, char16_t or char32_t string literals are being created.
Character literals [type]'character' can optionally take the form of a (possibly multi-byte) UTF-8 sequence, or use an escape code:
| Escape sequence | Meaning |
\123 \o{123} | Octal value ‘123’ |
\xABC \x{ABC} | Hexadecimal value ABC |
\u01fd \u{1fd} | Unicode character U+1FD |
\U0001f34d | Unicode character U+1F34D |
\N{LATIN SMALL LETTER A} | Named Unicode character U+61 |
(Double-byte wchar_t character literals such as L'cd' are no longer supported in C++23.)
Unicode literals of the form \u20ac (€) or \U0001f34d (🍍) can be used within both char and char8_t literals ("..." and u8"..."); previously only char8_t literals could be used for Unicode characters outside of the character set for the source file.
Unicode identifiers
Just about any UTF-8 code sequence can be used for identifiers (variable names, class names, function or method names) in C++23.
#include <iostream>
float π = 3.14f;
int main() {
std::cout << π << '\n';;
}
If the pre-processor can handle UTF-8 (an there is little reason why it shouldn’t), a “macro language” can be created to translate any C++ keyword. The following is an attempt to write the C++ “Hello, World!” program in Standardized Chinese (apologies if my use of web translation is incorrect in any way):
#include <iostream>
#define 整数 int
#define 主 main
#define 标准 std
#define 字符输出 cout
整数 主() {
标准::字符输出 << "你好,世界!\n";
}
Character set conversions
Unfortunately there still appears to be no standardized way in C++ of converting between text encodings, other than using the deprecated (to be removed in C++26) <codecvt> header. Therefore using third-party tools such as the International Components for Unicode (recommended, see https://unicode-org.github.io/icu/download/) is necessary.
Text encoding support
Looking ahead to C++26, a new standard library header <text_encoding> will provide a type which can be queried to determine the current text encoding. The current encoding for literals can be checked at compile-time using code such as:
#include <text_encoding>
static_assert(std::text_encoding::literal() == std::text_encoding::UTF8);
The runtime text encoding and encoding for the default locale are also able to be queried:
std::text_encoding env_encoding = std::text_encoding::environment();
std::text_encoding locale_encoding = std::locale("").encoding();
All available encodings are listed in an enum type, a value from which is returned by member function std::text_encoding::mib(). For more details see cppreference.com.
Conclusion
In summary, C++ is continuing to add to its Unicode support in the current and upcoming versions. The UTF-8 encoding is mandated for source files from C++23 onwards, and creation of string and character literals is easy and consistent, with \u and \U escape codes and \N{...} for named Unicode entities. Under modern Linuxes outputting to std::cout usually implies a UTF-8 locale, while under Windows chcp 65001 in a command prompt selects UTF-8 as the input and output encoding for the session. (Windows also uses std::wcout as UTF-16 encoded output.) Character set conversions are more involved and require third-party libraries such as ICU.
Be careful when using chcp 65001!
Better use wcout for windows.
LikeLike