UTF-8 vs w_char

Siri Cruz · Nov 3, 2013

Malcolm McLean said:
I'm thinking of UTF-8 rather than wide characters, mainly so that in the

I would suggest sticking to Unicode and let callers use iconv to handle anything
else. If wchar is Unicode, there's little problem supporting both. Conversion
between UTF8, UTF16, and Unicode is straightforward. You could designate one
(the most frequently used?) as a base implementation and then do alternate
versions that convert to, call the base, and convert back.

Malcolm McLean · Nov 3, 2013

The time has come to start thinking about adding non-Ascii support to Baby X.
(Baby X is a simple toolkit, initially for X-windows, designed to be a
lightweight answer to getting a GUI together, when you need maybe a few
buttons and a couple of dialogs).

I'm thinking of UTF-8 rather than wide characters, mainly so that in the
monoglot case you can just pass strings around as regular char *s. But
there are some issues. Presumably the resource compiler will have to spit
strings out as non-human readable unsigned char[] arrays. Obviously there
are a whole set of issues involved in moving to non-Latin alphabets,
but are there any other specific problems with UTF-8? Fast string operations
are unlikely to be much of an issue, internally rendering is by drawing
anti-aliased pixels to a canvas now, so that will dwarf any multi-byte
processing overhead.

scholz.lothar · Nov 3, 2013

Use UTF-8 or UTF-32 if you do a lot of linguistic operations.

Most people unfortunately do not understand that UTF-8 is ASCII in most operations and only unicode in certain linguistic operations for which you need special integrity preserving operations.

By the way, X/Wayland/Freetype/Pango all use UTF-8

Xavier Roche · Nov 3, 2013

Le 03/11/2013 15:46, Malcolm McLean a écrit :

but are there any other specific problems with UTF-8?

UTF-8:
+ keep char* everywhere
+ ascii-compatible (subset of UTF8)
+ multibyte characters have clear ranges (0..127: ascii, 128..191:
continuation byte, etc.)
- can not fetch Nth character in O(1)
- cutting string (ex: word wrap) has to be done with caution

Wide char:
+ can fetch Nth character in O(1) and cut strings easily
- new API everywhere (and with embedded zeros, you can not just adapt
char* versions)
- if "only" UCS-2 (whar_t is 32-bit generally), you will have the same
issues as UTF-8 (UTF-16 to build higher planes characters)
- waste of space for ascii

I'd personally go with UTF-8 -- this is a fine piece of work

Sven Köhler · Nov 3, 2013

Am 03.11.2013 18:15, schrieb Xavier Roche:

Wide char:
+ can fetch Nth character in O(1) and cut strings easily

That isn't true for platforms using 16bit wide characters. On such
platforms, characters with a unicode codepoint larger than 2^16-1 are
often represented by surrogate pairs. In other words, they use UTF-16.

Xavier Roche · Nov 3, 2013

Le 03/11/2013 17:36, Sven Köhler a écrit :

That isn't true for platforms using 16bit wide characters. On such
platforms, characters with a unicode codepoint larger than 2^16-1 are
often represented by surrogate pairs. In other words, they use UTF-16.

Yes, but I was assuming sizeof(wchar_t)==4 here.

[ And yes, surrogates are evil. Yuk. ]

Siri Cruz · Nov 3, 2013

[QUOTE="Keith Thompson said:
I would suggest sticking to Unicode and let callers use iconv to
handle anything else. If wchar is Unicode, there's little problem
supporting both. Conversion between UTF8, UTF16, and Unicode is
straightforward. You could designate one (the most frequently used?)
as a base implementation and then do alternate versions that convert
to, call the base, and convert back.

It's not clear that you understand what "Unicode" means.

Unicode is a mapping between characters and "code points", which are
integer values. It does not by itself specify how those code points are[/QUOTE]

And there something like 2^n (n=24?) code points each assigned an integer value
from 0 to 2^n-1. So I use 'Unicode' to refer to a set and C representation that
is isomorphic to the set of code points.

represented. UTF-8 and UTF-16 are two different ways of representing
Unicode code points. "Conversion between Unicode and UTF-8" doesn't
make sense, since UTF-8 is already a representation of Unicode.

UTF-8 and UTF-16 are maps from the Unicode set to strings of 8 or 16 bit
naturals. They are not isomorphic because some 8 bit strings do not map into
Unicode.

wchar_t is an integer type, defined in <stddef.h>. It's typically 16
bits on Windows (usually representing UTF-16) and 32 bits on Linux
(usually representing UTF-32).

But it could be whatever the C implementor regards as natural wide character set.

Keith Thompson · Nov 3, 2013

Xavier Roche said:
Le 03/11/2013 17:36, Sven KÃ¶hler a Ã©crit :

That isn't true for platforms using 16bit wide characters. On such
platforms, characters with a unicode codepoint larger than 2^16-1 are
often represented by surrogate pairs. In other words, they use UTF-16.

Click to expand...

Yes, but I was assuming sizeof(wchar_t)==4 here.

[ And yes, surrogates are evil. Yuk. ]

Windows has a 16-bit wchar_t.

Xavier Roche · Nov 3, 2013

Le 03/11/2013 21:30, Keith Thompson a Ã©crit :

Windows has a 16-bit wchar_t.

Yes, and Java does, too (and is also using a modified version of UTF-16
for reasons I will not describe here for the sanity of the readers)

Kenny McCormack · Nov 3, 2013

Xavier Roche said:
Xavier Roche said:

Le 03/11/2013 17:36, Sven KÃ¶hler a Ã©crit :

That isn't true for platforms using 16bit wide characters. On such
platforms, characters with a unicode codepoint larger than 2^16-1 are
often represented by surrogate pairs. In other words, they use UTF-16.

Click to expand...

Yes, but I was assuming sizeof(wchar_t)==4 here.

[ And yes, surrogates are evil. Yuk. ]

Click to expand...

Windows has a 16-bit wchar_t.

Off topic. Not portable. Cant discuss it here. Blah, blah, blah.

Surely, you, of all people, Chairman Kiki, realize how off-topic your post
is. How can you bear the shame?

--
Useful clc-related links:

http://en.wikipedia.org/wiki/Aspergers
http://en.wikipedia.org/wiki/Clique
http://en.wikipedia.org/wiki/C_programming_language

Keith Thompson · Nov 3, 2013

Siri Cruz said:
[QUOTE="Keith Thompson said:

I'm thinking of UTF-8 rather than wide characters, mainly so that in the

I would suggest sticking to Unicode and let callers use iconv to
handle anything else. If wchar is Unicode, there's little problem
supporting both. Conversion between UTF8, UTF16, and Unicode is
straightforward. You could designate one (the most frequently used?)
as a base implementation and then do alternate versions that convert
to, call the base, and convert back.

Click to expand...

It's not clear that you understand what "Unicode" means.

Unicode is a mapping between characters and "code points", which are
integer values. It does not by itself specify how those code points are

And there something like 2^n (n=24?) code points each assigned an integer value
from 0 to 2^n-1. So I use 'Unicode' to refer to a set and C representation that
is isomorphic to the set of code points.[/QUOTE]

Unicode consists of 17 planes of 65536 code points each, for a total of
1,114,112 code points from 0x0 to 0x10FFFF. (I recall seeing a
statement on unicode.org that it will never exceed that upper bound.)
So 21 bits are more than enough to represent all code points.

The term Unicode by itself refers to the mapping between characters and
code points, *not* to any particular representation of that mapping.
It's an important distinction.

What exactly do you mean by "Conversion between UTF8, UTF16, and
Unicode"? If you're talking about a representation that uses a full 32
bits to represent each code point, that's called UTF-32 or UCS-4.

UTF-8 and UTF-16 are maps from the Unicode set to strings of 8 or 16 bit
naturals. They are not isomorphic because some 8 bit strings do not map into
Unicode.

A small quibble: the term "natural number" traditionally refers only to
*positive* integers; 0 is a valid Unicode code point. (The term
"natural number" is also sometimes used to refer to the non-negative
integers.)

But it could be whatever the C implementor regards as natural wide
character set.

Sure.

I'm not sure that Microsoft's use of 16 bits for wchar_t is even
conforming. The C standard says that wchar_t "is an integer type whose
range of values can represent distinct codes for all members of the
largest extended character set specified among the supported locales".
16 bits covers UCS-2 (which can only represent code points from 0 to
65535), but using it for UTF-16 arguably violates the C standard's
requirements.

Stephen Sprunk · Nov 4, 2013

Sure.

I'm not sure that Microsoft's use of 16 bits for wchar_t is even
conforming. The C standard says that wchar_t "is an integer type
whose range of values can represent distinct codes for all members of
the largest extended character set specified among the supported
locales". 16 bits covers UCS-2 (which can only represent code points
from 0 to 65535), but using it for UTF-16 arguably violates the C
standard's requirements.

Code points 0x0 to 0xFFFF comprise the Basic Multilingual Plane, which
is all that the vast majority of people will ever use. If Windows does
not support any locale with a code point outside the BMP, which is
entirely possible, then wouldn't using UCS-2 be conforming?

It sounds like using UTF-16 to support locales with code points outside
the BMP would be non-conforming, but Microsoft settled on UCS-2 long
ago, likely before there _was_ anything outside the BMP. If they did
change to UTF-16, that is arguably a better business decision than
breaking binary compatibility to use UTF-32/UCS-4.

S

Stephen Sprunk · Nov 4, 2013

The time has come to start thinking about adding non-Ascii support to
Baby X. (Baby X is a simple toolkit, initially for X-windows,
designed to be a lightweight answer to getting a GUI together, when
you need maybe a few buttons and a couple of dialogs).

I'm thinking of UTF-8 rather than wide characters, mainly so that in
the monoglot case you can just pass strings around as regular char
*s.

Even in the polyglot case it is often possible to do that; that is one
of the biggest reasons UTF-8 is so popular.

But there are some issues. Presumably the resource compiler will
have to spit strings out as non-human readable unsigned char[]
arrays.

They're human-readable if the software you're displaying them with
understands UTF-8, and most of it does these days--particularly if you
put the (invalid) "UTF-8 BOM" at the start of your files.

Obviously there are a whole set of issues involved in moving
to non-Latin alphabets, but are there any other specific problems
with UTF-8? Fast string operations are unlikely to be much of an
issue, internally rendering is by drawing anti-aliased pixels to a
canvas now, so that will dwarf any multi-byte processing overhead.

The main problem with UTF-8 is that the number of bytes is not equal to
the number of characters, nor is the ratio even fixes. However,
combining characters result in the same problem for _all_ encodings, so
you can't avoid that problem anyway. Most other char-oriented string
operations work on UTF-8 without change.

UTF-16 implementations typically have problems properly handling the
surrogate pairs for code points outside the BMP. UTF-16 and UTF-32 both
present endianness challenges.

UTF-8 requires more bytes than UTF-16 for pure text in certain
languages, but most files have a high enough fraction of ASCII
characters that in practice UTF-8 usually beats UTF-16 despite this
apparent disadvantage. UTF-32, of course, always loses.

S

Keith Thompson · Nov 4, 2013

Stephen Sprunk said:
Code points 0x0 to 0xFFFF comprise the Basic Multilingual Plane, which
is all that the vast majority of people will ever use. If Windows does
not support any locale with a code point outside the BMP, which is
entirely possible, then wouldn't using UCS-2 be conforming?

It sounds like using UTF-16 to support locales with code points outside
the BMP would be non-conforming, but Microsoft settled on UCS-2 long
ago, likely before there _was_ anything outside the BMP. If they did
change to UTF-16, that is arguably a better business decision than
breaking binary compatibility to use UTF-32/UCS-4.

That makes sense.

I think that at the time Windows added its initial Unicode support,
there were no defined code points above 65535, so a 16-bit wchar_t
representing a UCS-2 character would have been conforming.

When Unicode expanded past the BMP, it wasn't possible to expand
wchar_t to 32 bits without breaking backward compatibility.
They did add support for UTF-16.

(I'm not sure how Unix and Linux managed to avoid this problem.)

C11 (and C++11) added char16_t and char32_t, which should make it
easier to deal with Unicode data portably.

Johannes Bauer · Nov 4, 2013

Am 03.11.2013 21:33, schrieb Xavier Roche:

Le 03/11/2013 21:30, Keith Thompson a Ã©crit :

Yes, and Java does, too (and is also using a modified version of UTF-16
for reasons I will not describe here for the sanity of the readers.

I find this intriguing. Why do they do modify UTF-16? Can you at least
give a pointer so I can google the whole story? My sanity is arguably
already compromised, no worries.

Regards,
Johannes

Stephen Sprunk · Nov 4, 2013

Le 03/11/2013 21:30, Keith Thompson a Ã©crit :

Yes, and Java does, too (and is also using a modified version of
UTF-16 for reasons I will not describe here for the sanity of the
readers)

I thought Java only modified UTF-8 (to encode embedded NULs as the
overlong sequence 0xC0 0x80), and even then only in certain cases, such
as serializing objects. AFAIK, Java's UTF-16 is completely normal.

S

Malcolm McLean · Nov 4, 2013

But there are some issues. Presumably the resource compiler will
have to spit strings out as non-human readable unsigned char[]
arrays.

Click to expand...

They're human-readable if the software you're displaying them with
understands UTF-8, and most of it does these days--particularly if you
put the (invalid) "UTF-8 BOM" at the start of your files.

BabyX comes with a resource compiler which generates images, fonts, and
strings. It dumps them as C source files.

TTF fonts, which it understands, come with a unicode value for every glyph,
So it stores those values and, internally Baby C looks them up before
doing its own rasterisation. String are currently spat out as normal C
strings. So if you enter Fred as a string value, it will produce a variable
char *fred_str - "Fred";
But currently the string code has no support for non-ascii. Is it possible
to spit out a UTF-8 string and have most editors display it in a (polyglot)
human-readable form?

Noob · Nov 4, 2013

Johannes said:
Xavier said:

Java [...] is also using a modified version of UTF-16 for reasons
I will not describe here for the sanity of the readers.

Click to expand...

I find this intriguing.

And do you wish to subscribe to his newsletter?

https://en.wikiquote.org/wiki/The_Simpsons/Season_8#Mountain_of_Madness

Anyway, maybe this is a start
http://programmers.stackexchange.co...use-utf-16-for-internal-string-representation
http://stackoverflow.com/questions/...ferent-character-encodings-throughout-their-s

Regards.

James Kuyper · Nov 4, 2013

Johannes said:
Johannes said:

Xavier said:

Java [...] is also using a modified version of UTF-16 for reasons
I will not describe here for the sanity of the readers.

Click to expand...

I find this intriguing.

Click to expand...

And do you wish to subscribe to his newsletter?

https://en.wikiquote.org/wiki/The_Simpsons/Season_8#Mountain_of_Madness

Anyway, maybe this is a start
http://programmers.stackexchange.co...use-utf-16-for-internal-string-representation
http://stackoverflow.com/questions/...ferent-character-encodings-throughout-their-s

Perhaps I skimmed too rapidly, but I couldn't find anything at either of
those two websites about Java using a modified version of UTF-16, much
less a description of either the nature of or the motivations for the
modification.

Noob · Nov 4, 2013

James said:
Perhaps I skimmed too rapidly, but I couldn't find anything at either of
those two websites about Java using a modified version of UTF-16, much
less a description of either the nature of or the motivations for the
modification.

Hmmm, perhaps you're right... the first page gives a link to
http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html
which discusses "Java modified UTF-8" which I thought might be
superficially related (I didn't read the whole article).

Unicode (UTF-8) in C	13	Mar 16, 2014
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
UTF-8 and strings	44	Jun 7, 2011
CGI and UTF-8	14	Sep 28, 2009
hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
UTF-8 to Unicode conversion in ajax response	9	May 17, 2011
converting UTF-8 to unicode hex with perl	4	Jun 27, 2009
XML::LibXML UTF-8 toString() -vs- nodeValue()	36	Apr 8, 2009

UTF-8 vs w_char

Siri Cruz

Malcolm McLean

scholz.lothar

Xavier Roche

Sven Köhler

Xavier Roche

Siri Cruz

Keith Thompson

Xavier Roche

Kenny McCormack

Keith Thompson

Stephen Sprunk

Stephen Sprunk

Keith Thompson

Johannes Bauer

Stephen Sprunk

Malcolm McLean

Noob

James Kuyper

Noob

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads