E.g., if you want to have a String literal with U+10C22 (that's
OLD TURKIC LETTER ORKHON EM; it somewhat looks like a fish),
then you first convert 0x10C22 to a surrogate pair:
1. subtract 0x10000: you get 0xC22
2. get the upper (u) and lower (l) 10 bits; you get u=0x3 and l=0x022
(i.e. (u << 10) + l == 0xC22)
3. the high surrogate is 0xD800 + u, the low surrogate is 0xDC00 + l.
That is what I was afraid of. I am doing that now to generate tables
of char entities and the equivalent hex and \u entities on various
pages of mindprod.com, e.g.
http://mindprod.com/jgloss/html5.html
which shows the new HTML entities in HTML 5.
here is my code:
final int extract = theCharNumber - 0x10000;
final int high = ( extract >>> 10 ) + 0xd800;
final int low = ( extract & 0x3ff ) + 0xdc00;
sb.append( ""\\u" );
sb.append( StringTools.toLzHexString( high, 4 ) );
sb.append( "\\u" );
sb.append( StringTools.toLzHexString( low, 4 ) );
sb.append( """ );
I started to think about what would be needed to make this less
onerous.
1. an applet to convert hex to a surrogate pair.
2. allow \u12345 in string literals. However that would break
existing code. \u12345 currently means
"\u1234" + "5".
3. So you have to pick another letter: e.g. \c12345; for codepoint. IT
needs a terminator, so that in future it could also handle \c123456;
I don't know what that might break.
4. Introduce 32-bit CodePoint string literals with extensible \u
mechanism. E.g. CString b = c"\u12345;Hello";
5. specify weird chars with named entities to make the code more
readable. Entities in String literals would be translated to binary
at compile time, so the entities would not exist at run-time. The
HTML 5 set would be greatly extended to give pretty well every Unicode
glyph a name.
P.S. I have been poking around in HTML 5. W3C did an odd thing. They
REDEFINED the entities ⟨ and ⟩ to different glyphs from HTML
4. I don't think they have ever done anything like that before. I
hope it was just an error. I have written the W3C asking if they
really meant to do that.