Is it safe to use UTF-8 in comments?

S

Szabolcs

I am not familiar with the UTF-8 encoding, but I know that it encodes
certain characters with up to four bytes. Is it safe to use UTF-8
encoded comments in C++ source files? For example, is there a remote
possibility that some multi-byte character, when interpreted
byte-by-byte, will contain */ and close the comment? Or is there
something else that can go wrong?
 
C

Charles Bailey

I am not familiar with the UTF-8 encoding, but I know that it encodes
certain characters with up to four bytes. Is it safe to use UTF-8
encoded comments in C++ source files? For example, is there a remote
possibility that some multi-byte character, when interpreted
byte-by-byte, will contain */ and close the comment? Or is there
something else that can go wrong?

Due to the nature of the UTF-8 encoding it is the case that no
byte of any multibyte character will ever match any byte that is a
valid single byte character. In particular no part of a multibyte
character could ever match the two byte, two character sequence "*/".

I'm personally in favour of keeping C++ source files to containing
only us-ascii characters so that compilers which treat source files as
iso8859-1, us-ascii, utf-8 and wider range of other character encoding
schemes will interpret the source files in exactly the same way.

Just in comments, it would be a picky compiler that would complain
about utf-8 (unless of course it was expecting some EBCDIC like, or
other non-ISO-636 invariant subset compatible encoding).
 
V

Victor Bazarov

Szabolcs said:
I am not familiar with the UTF-8 encoding, but I know that it encodes
certain characters with up to four bytes. Is it safe to use UTF-8
encoded comments in C++ source files? For example, is there a remote
possibility that some multi-byte character, when interpreted
byte-by-byte, will contain */ and close the comment? Or is there
something else that can go wrong?

Your reservations are valid, but you can rest easy. Comments could
contain any characters, and they are mapped during the very first
phase of translation into characters from the basic character set.

V
 
C

Charles Bailey

Just in comments, it would be a picky compiler that would complain
about utf-8 (unless of course it was expecting some EBCDIC like, or
other non-ISO-636 invariant subset compatible encoding).

That would of course be ISO-646.
 
S

Szabolcs

Victor said:
Your reservations are valid, but you can rest easy. Comments could
contain any characters, and they are mapped during the very first
phase of translation into characters from the basic character set.

V

Thanks for all the replies!

Szabolcs
 
J

James Kanze

I am not familiar with the UTF-8 encoding, but I know that it encodes
certain characters with up to four bytes. Is it safe to use UTF-8
encoded comments in C++ source files? For example, is there a remote
possibility that some multi-byte character, when interpreted
byte-by-byte, will contain */ and close the comment? Or is there
something else that can go wrong?

Formally: "Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary." That's the very first thing that happens, and the
standard places no restrictions on how this occurs: an
implementation could legally just and each character with 0x7F,
for example. (Because this is "implementation defined", the
implementation is required to document what it does. Good luck
in finding such documentation.) And implementation could also
simply decide that they are illegal characters, that the source
file is corrupt, and throw it out.

In practice, on a machine using an ASCII compatible encoding for
the basic character set, there's probably no risk in comments.
UTF-8 was designed expressedly so that all bytes in a multibyte
sequence have the bit 7 set, and thus can never be mistaken for
a single byte character, and implementations just copy the bytes
of comments, until they encounter the end of line character or
the sequence "*/" (depending on the type of comment).
 
J

James Kanze

Your reservations are valid, but you can rest easy. Comments could
contain any characters, and they are mapped during the very first
phase of translation into characters from the basic character set.

Yup. And the mapping can be anything the implementation wants.
Which theoretically, at least, is far from assuring.
 
D

desktop

Charles said:
Due to the nature of the UTF-8 encoding it is the case that no
byte of any multibyte character will ever match any byte that is a
valid single byte character. In particular no part of a multibyte
character could ever match the two byte, two character sequence "*/".

I'm personally in favour of keeping C++ source files to containing
only us-ascii characters so that compilers which treat source files as
iso8859-1, us-ascii, utf-8 and wider range of other character encoding
schemes will interpret the source files in exactly the same way.

Just in comments, it would be a picky compiler that would complain
about utf-8 (unless of course it was expecting some EBCDIC like, or
other non-ISO-636 invariant subset compatible encoding).

I started with writing my code in UTF8, but when it try to include it
with listings in LaTeX I get an error when using characters æ,ø and å.
Maybe you don't use those chars but still I would stick to ISO.8859-1
which I am currently using instead.
 
R

Robert Bauck Hamar

desktop said:
I started with writing my code in UTF8, but when it try to include it
with listings in LaTeX I get an error when using characters æ,ø and å.

Generally, I use only English in source code. I both results in limited
needs for non-ASCII characters as well as readability for programmers not
speaking nordic languages, and no IMO ugly mix between languages.
Maybe you don't use those chars but still I would stick to ISO.8859-1
which I am currently using instead.

Which would also cause troubles with LaTeX if you're running it in another
charset. LaTeX also handles utf8 these days. The simple change from
\usepackage[latin1]{inputenc}
to
\usepackage[utf8]{inputenc}
works quite well on my computer. But this will result in problems with
latin1 files.
 
D

desktop

Robert said:
desktop said:
I started with writing my code in UTF8, but when it try to include it
with listings in LaTeX I get an error when using characters æ,ø and å.

Generally, I use only English in source code. I both results in limited
needs for non-ASCII characters as well as readability for programmers not
speaking nordic languages, and no IMO ugly mix between languages.
Maybe you don't use those chars but still I would stick to ISO.8859-1
which I am currently using instead.

Which would also cause troubles with LaTeX if you're running it in another
charset. LaTeX also handles utf8 these days. The simple change from
\usepackage[latin1]{inputenc}
to
\usepackage[utf8]{inputenc}
works quite well on my computer. But this will result in problems with
latin1 files.

I have tried giving utf8 as option to inputenc and keep my sourcecode
and .tex file in utf8 but it gives an error when I include with listings.

It seems that others also have this problem:


http://groups.google.dk/group/linux...:+Unicode+char+&rnum=8&hl=da#4db764476c6b5411

But it can partly be solved when setting extendedchars=false.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,293
Messages
2,571,505
Members
48,192
Latest member
LinwoodFol

Latest Threads

Top