I am not familiar with the UTF-8 encoding, but I know that it encodes
certain characters with up to four bytes. Is it safe to use UTF-8
encoded comments in C++ source files? For example, is there a remote
possibility that some multi-byte character, when interpreted
byte-by-byte, will contain */ and close the comment? Or is there
something else that can go wrong?
Formally: "Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary." That's the very first thing that happens, and the
standard places no restrictions on how this occurs: an
implementation could legally just and each character with 0x7F,
for example. (Because this is "implementation defined", the
implementation is required to document what it does. Good luck
in finding such documentation.) And implementation could also
simply decide that they are illegal characters, that the source
file is corrupt, and throw it out.
In practice, on a machine using an ASCII compatible encoding for
the basic character set, there's probably no risk in comments.
UTF-8 was designed expressedly so that all bytes in a multibyte
sequence have the bit 7 set, and thus can never be mistaken for
a single byte character, and implementations just copy the bytes
of comments, until they encounter the end of line character or
the sequence "*/" (depending on the type of comment).