Is it safe to use UTF-8 in comments?

Szabolcs · Jun 8, 2007

I am not familiar with the UTF-8 encoding, but I know that it encodes
certain characters with up to four bytes. Is it safe to use UTF-8
encoded comments in C++ source files? For example, is there a remote
possibility that some multi-byte character, when interpreted
byte-by-byte, will contain */ and close the comment? Or is there
something else that can go wrong?

Charles Bailey · Jun 8, 2007

I am not familiar with the UTF-8 encoding, but I know that it encodes
certain characters with up to four bytes. Is it safe to use UTF-8
encoded comments in C++ source files? For example, is there a remote
possibility that some multi-byte character, when interpreted
byte-by-byte, will contain */ and close the comment? Or is there
something else that can go wrong?

Due to the nature of the UTF-8 encoding it is the case that no
byte of any multibyte character will ever match any byte that is a
valid single byte character. In particular no part of a multibyte
character could ever match the two byte, two character sequence "*/".

I'm personally in favour of keeping C++ source files to containing
only us-ascii characters so that compilers which treat source files as
iso8859-1, us-ascii, utf-8 and wider range of other character encoding
schemes will interpret the source files in exactly the same way.

Just in comments, it would be a picky compiler that would complain
about utf-8 (unless of course it was expecting some EBCDIC like, or
other non-ISO-636 invariant subset compatible encoding).

Victor Bazarov · Jun 8, 2007

Szabolcs said:
I am not familiar with the UTF-8 encoding, but I know that it encodes
certain characters with up to four bytes. Is it safe to use UTF-8
encoded comments in C++ source files? For example, is there a remote
possibility that some multi-byte character, when interpreted
byte-by-byte, will contain */ and close the comment? Or is there
something else that can go wrong?

Your reservations are valid, but you can rest easy. Comments could
contain any characters, and they are mapped during the very first
phase of translation into characters from the basic character set.

V

Charles Bailey · Jun 8, 2007

Just in comments, it would be a picky compiler that would complain
about utf-8 (unless of course it was expecting some EBCDIC like, or
other non-ISO-636 invariant subset compatible encoding).

That would of course be ISO-646.

Szabolcs · Jun 8, 2007

Victor said:
Your reservations are valid, but you can rest easy. Comments could
contain any characters, and they are mapped during the very first
phase of translation into characters from the basic character set.

V

Thanks for all the replies!

Szabolcs

James Kanze · Jun 8, 2007

I am not familiar with the UTF-8 encoding, but I know that it encodes
certain characters with up to four bytes. Is it safe to use UTF-8
encoded comments in C++ source files? For example, is there a remote
possibility that some multi-byte character, when interpreted
byte-by-byte, will contain */ and close the comment? Or is there
something else that can go wrong?

Formally: "Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary." That's the very first thing that happens, and the
standard places no restrictions on how this occurs: an
implementation could legally just and each character with 0x7F,
for example. (Because this is "implementation defined", the
implementation is required to document what it does. Good luck
in finding such documentation.) And implementation could also
simply decide that they are illegal characters, that the source
file is corrupt, and throw it out.

In practice, on a machine using an ASCII compatible encoding for
the basic character set, there's probably no risk in comments.
UTF-8 was designed expressedly so that all bytes in a multibyte
sequence have the bit 7 set, and thus can never be mistaken for
a single byte character, and implementations just copy the bytes
of comments, until they encounter the end of line character or
the sequence "*/" (depending on the type of comment).

James Kanze · Jun 8, 2007

Your reservations are valid, but you can rest easy. Comments could
contain any characters, and they are mapped during the very first
phase of translation into characters from the basic character set.

Yup. And the mapping can be anything the implementation wants.
Which theoretically, at least, is far from assuring.

desktop · Jun 12, 2007

Charles said:
Due to the nature of the UTF-8 encoding it is the case that no
byte of any multibyte character will ever match any byte that is a
valid single byte character. In particular no part of a multibyte
character could ever match the two byte, two character sequence "*/".

I'm personally in favour of keeping C++ source files to containing
only us-ascii characters so that compilers which treat source files as
iso8859-1, us-ascii, utf-8 and wider range of other character encoding
schemes will interpret the source files in exactly the same way.

Just in comments, it would be a picky compiler that would complain
about utf-8 (unless of course it was expecting some EBCDIC like, or
other non-ISO-636 invariant subset compatible encoding).

I started with writing my code in UTF8, but when it try to include it
with listings in LaTeX I get an error when using characters æ,ø and å.
Maybe you don't use those chars but still I would stick to ISO.8859-1
which I am currently using instead.

Robert Bauck Hamar · Jun 12, 2007

desktop said:
I started with writing my code in UTF8, but when it try to include it
with listings in LaTeX I get an error when using characters Ã¦,Ã¸ and Ã¥.

Generally, I use only English in source code. I both results in limited
needs for non-ASCII characters as well as readability for programmers not
speaking nordic languages, and no IMO ugly mix between languages.

Maybe you don't use those chars but still I would stick to ISO.8859-1
which I am currently using instead.

Which would also cause troubles with LaTeX if you're running it in another
charset. LaTeX also handles utf8 these days. The simple change from
\usepackage[latin1]{inputenc}
to
\usepackage[utf8]{inputenc}
works quite well on my computer. But this will result in problems with
latin1 files.

desktop · Jun 12, 2007

Robert said:
desktop said:

I started with writing my code in UTF8, but when it try to include it
with listings in LaTeX I get an error when using characters Ã¦,Ã¸ and Ã¥.

Click to expand...

Generally, I use only English in source code. I both results in limited
needs for non-ASCII characters as well as readability for programmers not
speaking nordic languages, and no IMO ugly mix between languages.

Maybe you don't use those chars but still I would stick to ISO.8859-1
which I am currently using instead.

Click to expand...

Which would also cause troubles with LaTeX if you're running it in another
charset. LaTeX also handles utf8 these days. The simple change from
\usepackage[latin1]{inputenc}
to
\usepackage[utf8]{inputenc}
works quite well on my computer. But this will result in problems with
latin1 files.

I have tried giving utf8 as option to inputenc and keep my sourcecode
and .tex file in utf8 but it gives an error when I include with listings.

It seems that others also have this problem:

http://groups.google.dk/group/linux...:+Unicode+char+&rnum=8&hl=da#4db764476c6b5411

But it can partly be solved when setting extendedchars=false.

UTF-8 and strings	44	Jun 7, 2011
Unicode (UTF-8) in C	13	Mar 16, 2014
UTF-8 read & print?	6	Nov 25, 2012
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
CGI and UTF-8	14	Sep 28, 2009
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
XMLRPC (REXML) incorrectly handles UTF-8 data	6	Nov 16, 2010
UTF-8 to Unicode conversion in ajax response	9	May 17, 2011

Is it safe to use UTF-8 in comments?

Szabolcs

Charles Bailey

Victor Bazarov

Charles Bailey

Szabolcs

James Kanze

James Kanze

desktop

Robert Bauck Hamar

desktop

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads