32-bit characters in Java string literals

Roedy Green · Dec 22, 2009

Let's say you wanted to include some 32-bit characters in Java String
literals.

I understand what the stream would look like in UTF-8 or a int[], but
what I am curious about is the cleanest way to create string literals
in a Java program containing such awkward characters.

Roedy Green · Dec 23, 2009

E.g., if you want to have a String literal with U+10C22 (that's
OLD TURKIC LETTER ORKHON EM; it somewhat looks like a fish),
then you first convert 0x10C22 to a surrogate pair:
1. subtract 0x10000: you get 0xC22
2. get the upper (u) and lower (l) 10 bits; you get u=0x3 and l=0x022
(i.e. (u << 10) + l == 0xC22)
3. the high surrogate is 0xD800 + u, the low surrogate is 0xDC00 + l.

That is what I was afraid of. I am doing that now to generate tables
of char entities and the equivalent hex and \u entities on various
pages of mindprod.com, e.g. http://mindprod.com/jgloss/html5.html
which shows the new HTML entities in HTML 5.

here is my code:

final int extract = theCharNumber - 0x10000;
final int high = ( extract >>> 10 ) + 0xd800;
final int low = ( extract & 0x3ff ) + 0xdc00;
sb.append( ""\\u" );
sb.append( StringTools.toLzHexString( high, 4 ) );
sb.append( "\\u" );
sb.append( StringTools.toLzHexString( low, 4 ) );
sb.append( """ );

I started to think about what would be needed to make this less
onerous.

1. an applet to convert hex to a surrogate pair.

2. allow \u12345 in string literals. However that would break
existing code. \u12345 currently means
"\u1234" + "5".

3. So you have to pick another letter: e.g. \c12345; for codepoint. IT
needs a terminator, so that in future it could also handle \c123456;
I don't know what that might break.

4. Introduce 32-bit CodePoint string literals with extensible \u
mechanism. E.g. CString b = c"\u12345;Hello";

5. specify weird chars with named entities to make the code more
readable. Entities in String literals would be translated to binary
at compile time, so the entities would not exist at run-time. The
HTML 5 set would be greatly extended to give pretty well every Unicode
glyph a name.

P.S. I have been poking around in HTML 5. W3C did an odd thing. They
REDEFINED the entities &lang; and &rang; to different glyphs from HTML
4. I don't think they have ever done anything like that before. I
hope it was just an error. I have written the W3C asking if they
really meant to do that.

Roedy Green · Dec 23, 2009

I started to think about what would be needed to make this less
onerous.

If you had only a few, you could create library of named constants for
them, and glue them together with compile time concatenation. With
only a little cleverness, a compiler would avoid embedding constants
it did not use.

Is any OS, JVM, utility, browser etc. capable of rendering a code
point above 0xffff? I get the impression all we can do is embed them
in UTF-8 files.

Andreas Leitgeb · Dec 23, 2009

Thomas Pornin said:
<< The Unicode standard was originally designed as a fixed-width 16-bit
character encoding. It has since been changed to allow for characters
whose representation requires more than 16 bits. The range of legal
code points is now U+0000 to U+10FFFF

I have problems understanding why the surrogate code points are counted
twice: once as their code points isolated and then again as the code-points
that are reached by an adjacent pair of them.

In my understanding that would make 0x10F7FF really legal codepoints, as
the surrogates wouldn't be legal as single code points, but only as pairs.

But then again, perhaps my own understanding of "legal code points" just
differs from some common definition.

Mayeul · Dec 23, 2009

Andreas said:
I have problems understanding why the surrogate code points are counted
twice: once as their code points isolated and then again as the code-points
that are reached by an adjacent pair of them.

It makes defining UTF-16 easy and less error-prone.

Yet I guess the range of legal codepoints is still be U+0000 to
U+10FFFF, excluding the surrogates range in the middle.

Tom Anderson · Dec 23, 2009

I have problems understanding why the surrogate code points are counted
twice: once as their code points isolated and then again as the code-points
that are reached by an adjacent pair of them.

The range is a bound - all legal code points are inside it. It doesn't
mean that all numbers inside it are legal code points. There are plenty of
numbers which aren't mapped to any character, and so aren't legal code
points - the surrogates are just a special case of those. I reckon.

tom

Andreas Leitgeb · Dec 23, 2009

Tom Anderson said:
The range is a bound - all legal code points are inside it. It doesn't
mean that all numbers inside it are legal code points. There are plenty of
numbers which aren't mapped to any character, and so aren't legal code
points - the surrogates are just a special case of those. I reckon.

Thanks, that was my catch: I somehow mistakenly took "range" as implying
"all in the range" - and a codepoint with no char mapped to it wasn't
necessarily illegal in my mind, but single surrogate was.

Roedy Green · Dec 23, 2009

IIRC, C99 introduced \uXXXX and \UXXXXXXXX.

It would make sense to follow suit. Life is complicated enough already
for people who code in more than one language each day.

Owen Jacobson · Dec 24, 2009

If you had only a few, you could create library of named constants for
them, and glue them together with compile time concatenation. With
only a little cleverness, a compiler would avoid embedding constants
it did not use.

Is any OS, JVM, utility, browser etc. capable of rendering a code
point above 0xffff? I get the impression all we can do is embed them
in UTF-8 files.

OS X comes with fonts that contain glyphs for some (but not all)
characters above U+FFFF out of the box, and can render them anywhere
they appear. Their visibility in Swing apps depends heavily on the L&F;
if you don't force it, Java will default to the Aqua L&F and render
most things correctly.

Webapps, obviously, render nothing; they send encoded characters to
other things, which may render them. Safari, Chrome, and Firefox can
all render U+1D360 (COUNTING ROD UNIT DIGIT ONE).

In the interests of science, what characters do you see on the next line?

ð„€ ð…€ ð† ðŒ€ ð€ ð‘ ð„¡

This message is encoded as UTF-8, and those should be, in order,

Codepoint (UTF-8 representation) NAME
U+10100 (F0 90 84 80) AGEAN WORD SEPARATOR LINE
U+10140 (F0 90 85 80) GREEK ACROPHONIC ATTIC ONE QUARTER
U+10190 (F0 90 86 90) ROMAN SEXTANS SIGN
U+10300 (F0 90 8C 80) OLD ITALIC LETTER A
U+10400 (F0 90 90 80) DESERET CAPITAL LETTER LONG I
U+10450 (F0 90 91 90) SHAVIAN LETTER PEEP
U+1D121 (F0 9D 84 A1) MUSICAL SYMBOL C CLEF

with spaces between.

Cheers,
-o

neuneudr · Dec 24, 2009

On Dec 24 said:
In the interests of science, what characters do you see on the next line?

ð„€ ð…€ ð† ðŒ€ ð€ ð‘ ð„¡

Debian Lenny / browser Iceweasel 3.0.6 (Firefox re-branded for true
freedom

I see boxes with tiny hexcode in them not corresponding to the
characters.

But then I can select them, past them in an xterm, where I see all
'? ? ? ? ?'
thinggies but then the file I pasted them in the terminal (using cat >
aa.txt)
contains the correct characters, as shown by an hexdump:

$ hexdump aa.txt
0000000 90f0 8084 f020 8590 2080 90f0 9086 f020
0000010 8c90 2080 90f0 8090 f020 9190 2090 9df0
0000020 a184 000a

neuneudr · Dec 24, 2009

...
...(ASCII works everywhere...

This

Here we've got a mix of Windows, Linux and OS X
devs so we're using scripts called at (Ant) build time that
enforces that all .java files:

a) use a subset of ASCII in their name
b) contains only ASCII characters

You can't build an app with non-ASCII characters in our
..java files and you certainly can't commit them

It's in the guidelines.

Better safe than sorry

Tom Anderson · Dec 24, 2009

In the interests of science, what characters do you see on the next line?

? ? ? ? ? ? ?

Seven question marks.

Using Alpine 1.10 on Debian 5.0.3 accessed over OpenSSH 5.1p1 from iTerm
0.10 on OS X 10.4.11. Plus a few more layers i've forgotten, probably.
Easily enough for one of them to drop the unicode ball somewhere!

tom

markspace · Dec 24, 2009

Owen said:
In the interests of science, what characters do you see on the next line?

ð„€ ð…€ ð† ðŒ€ ð€ ð‘ ð„¡

6 question marks and a [1/4].

I bet this has more to do with the news server we're each using than our
client's OS or newsreader. Vista/Thunderbird here.

Roedy Green · Dec 28, 2009

In the interests of science, what characters do you see on the next line?

? ? ? ? ? ? ?

Using Agent with Windows 7 64 bit I just see ? marks.

Why No Supplemental Characters In Character Literals?	76	Feb 4, 2011
Multicharacter literals	4	Aug 22, 2012
32/64 bit cc differences	110	Jan 10, 2014
Non latin characters in string literals	17	Jan 3, 2010
Questions on various string literals in c++0x	1	Dec 7, 2010
integer literals	14	Sep 26, 2010
64-bit integers where the implementation supports max 32-bit ints	37	Aug 5, 2013
java compiler and string literals	5	May 21, 2006

32-bit characters in Java string literals

Roedy Green

Roedy Green

Roedy Green

Andreas Leitgeb

Mayeul

Tom Anderson

Andreas Leitgeb

Roedy Green

Owen Jacobson

neuneudr

neuneudr

Tom Anderson

markspace

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads