32-bit characters in Java string literals

R

Roedy Green

Let's say you wanted to include some 32-bit characters in Java String
literals.

I understand what the stream would look like in UTF-8 or a int[], but
what I am curious about is the cleanest way to create string literals
in a Java program containing such awkward characters.
 
R

Roedy Green

E.g., if you want to have a String literal with U+10C22 (that's
OLD TURKIC LETTER ORKHON EM; it somewhat looks like a fish),
then you first convert 0x10C22 to a surrogate pair:
1. subtract 0x10000: you get 0xC22
2. get the upper (u) and lower (l) 10 bits; you get u=0x3 and l=0x022
(i.e. (u << 10) + l == 0xC22)
3. the high surrogate is 0xD800 + u, the low surrogate is 0xDC00 + l.

That is what I was afraid of. I am doing that now to generate tables
of char entities and the equivalent hex and \u entities on various
pages of mindprod.com, e.g. http://mindprod.com/jgloss/html5.html
which shows the new HTML entities in HTML 5.

here is my code:

final int extract = theCharNumber - 0x10000;
final int high = ( extract >>> 10 ) + 0xd800;
final int low = ( extract & 0x3ff ) + 0xdc00;
sb.append( "&quot;\\u" );
sb.append( StringTools.toLzHexString( high, 4 ) );
sb.append( "\\u" );
sb.append( StringTools.toLzHexString( low, 4 ) );
sb.append( "&quot;" );


I started to think about what would be needed to make this less
onerous.

1. an applet to convert hex to a surrogate pair.

2. allow \u12345 in string literals. However that would break
existing code. \u12345 currently means
"\u1234" + "5".

3. So you have to pick another letter: e.g. \c12345; for codepoint. IT
needs a terminator, so that in future it could also handle \c123456;
I don't know what that might break.

4. Introduce 32-bit CodePoint string literals with extensible \u
mechanism. E.g. CString b = c"\u12345;Hello";

5. specify weird chars with named entities to make the code more
readable. Entities in String literals would be translated to binary
at compile time, so the entities would not exist at run-time. The
HTML 5 set would be greatly extended to give pretty well every Unicode
glyph a name.

P.S. I have been poking around in HTML 5. W3C did an odd thing. They
REDEFINED the entities &lang; and &rang; to different glyphs from HTML
4. I don't think they have ever done anything like that before. I
hope it was just an error. I have written the W3C asking if they
really meant to do that.
 
R

Roedy Green

I started to think about what would be needed to make this less
onerous.

If you had only a few, you could create library of named constants for
them, and glue them together with compile time concatenation. With
only a little cleverness, a compiler would avoid embedding constants
it did not use.


Is any OS, JVM, utility, browser etc. capable of rendering a code
point above 0xffff? I get the impression all we can do is embed them
in UTF-8 files.
 
A

Andreas Leitgeb

Thomas Pornin said:
<< The Unicode standard was originally designed as a fixed-width 16-bit
character encoding. It has since been changed to allow for characters
whose representation requires more than 16 bits. The range of legal
code points is now U+0000 to U+10FFFF

I have problems understanding why the surrogate code points are counted
twice: once as their code points isolated and then again as the code-points
that are reached by an adjacent pair of them.

In my understanding that would make 0x10F7FF really legal codepoints, as
the surrogates wouldn't be legal as single code points, but only as pairs.

But then again, perhaps my own understanding of "legal code points" just
differs from some common definition.
 
M

Mayeul

Andreas said:
I have problems understanding why the surrogate code points are counted
twice: once as their code points isolated and then again as the code-points
that are reached by an adjacent pair of them.

It makes defining UTF-16 easy and less error-prone.

Yet I guess the range of legal codepoints is still be U+0000 to
U+10FFFF, excluding the surrogates range in the middle.
 
T

Tom Anderson

I have problems understanding why the surrogate code points are counted
twice: once as their code points isolated and then again as the code-points
that are reached by an adjacent pair of them.

The range is a bound - all legal code points are inside it. It doesn't
mean that all numbers inside it are legal code points. There are plenty of
numbers which aren't mapped to any character, and so aren't legal code
points - the surrogates are just a special case of those. I reckon.

tom
 
A

Andreas Leitgeb

Tom Anderson said:
The range is a bound - all legal code points are inside it. It doesn't
mean that all numbers inside it are legal code points. There are plenty of
numbers which aren't mapped to any character, and so aren't legal code
points - the surrogates are just a special case of those. I reckon.

Thanks, that was my catch: I somehow mistakenly took "range" as implying
"all in the range" - and a codepoint with no char mapped to it wasn't
necessarily illegal in my mind, but single surrogate was.
 
R

Roedy Green

IIRC, C99 introduced \uXXXX and \UXXXXXXXX.

It would make sense to follow suit. Life is complicated enough already
for people who code in more than one language each day.
 
O

Owen Jacobson

If you had only a few, you could create library of named constants for
them, and glue them together with compile time concatenation. With
only a little cleverness, a compiler would avoid embedding constants
it did not use.


Is any OS, JVM, utility, browser etc. capable of rendering a code
point above 0xffff? I get the impression all we can do is embed them
in UTF-8 files.

OS X comes with fonts that contain glyphs for some (but not all)
characters above U+FFFF out of the box, and can render them anywhere
they appear. Their visibility in Swing apps depends heavily on the L&F;
if you don't force it, Java will default to the Aqua L&F and render
most things correctly.

Webapps, obviously, render nothing; they send encoded characters to
other things, which may render them. Safari, Chrome, and Firefox can
all render U+1D360 (COUNTING ROD UNIT DIGIT ONE).

In the interests of science, what characters do you see on the next line?

ð„€ ð…€ ð† ðŒ€ ð€ ð‘ ð„¡

This message is encoded as UTF-8, and those should be, in order,

Codepoint (UTF-8 representation) NAME
U+10100 (F0 90 84 80) AGEAN WORD SEPARATOR LINE
U+10140 (F0 90 85 80) GREEK ACROPHONIC ATTIC ONE QUARTER
U+10190 (F0 90 86 90) ROMAN SEXTANS SIGN
U+10300 (F0 90 8C 80) OLD ITALIC LETTER A
U+10400 (F0 90 90 80) DESERET CAPITAL LETTER LONG I
U+10450 (F0 90 91 90) SHAVIAN LETTER PEEP
U+1D121 (F0 9D 84 A1) MUSICAL SYMBOL C CLEF

with spaces between.

Cheers,
-o
 
N

neuneudr

In the interests of science, what characters do you see on the next line?

ð„€ ð…€ ð† ðŒ€ ð€ ð‘ ð„¡

Debian Lenny / browser Iceweasel 3.0.6 (Firefox re-branded for true
freedom ;)
I see boxes with tiny hexcode in them not corresponding to the
characters.

But then I can select them, past them in an xterm, where I see all
'? ? ? ? ?'
thinggies but then the file I pasted them in the terminal (using cat >
aa.txt)
contains the correct characters, as shown by an hexdump:

$ hexdump aa.txt
0000000 90f0 8084 f020 8590 2080 90f0 9086 f020
0000010 8c90 2080 90f0 8090 f020 9190 2090 9df0
0000020 a184 000a

:)
 
N

neuneudr

...
...(ASCII works everywhere...

This

Here we've got a mix of Windows, Linux and OS X
devs so we're using scripts called at (Ant) build time that
enforces that all .java files:

a) use a subset of ASCII in their name
b) contains only ASCII characters

You can't build an app with non-ASCII characters in our
..java files and you certainly can't commit them :)

It's in the guidelines.

Better safe than sorry :)
 
T

Tom Anderson

In the interests of science, what characters do you see on the next line?

? ? ? ? ? ? ?

Seven question marks.

Using Alpine 1.10 on Debian 5.0.3 accessed over OpenSSH 5.1p1 from iTerm
0.10 on OS X 10.4.11. Plus a few more layers i've forgotten, probably.
Easily enough for one of them to drop the unicode ball somewhere!

tom
 
M

markspace

Owen said:
In the interests of science, what characters do you see on the next line?

ð„€ ð…€ ð† ðŒ€ ð€ ð‘ ð„¡

6 question marks and a [1/4].

I bet this has more to do with the news server we're each using than our
client's OS or newsreader. Vista/Thunderbird here.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,001
Messages
2,570,250
Members
46,848
Latest member
Graciela Mitchell

Latest Threads

Top