how do I expand a unicode string to its visual UTF8 representation?

Andrew · Aug 6, 2009

Hello,

I have an example program below that contains weird Icelandic
characters, and a copyright symbol, just for good measure. The code
expresses these as UTF8. They print exactly as you would want/expect
them to. So far so good. But what I want is to be able to go the other
way. I want to take a unicode string and recreate the escape sequences
for the funny international characters.For example, the single
character E-acute should be expanded to \u00C9 (6 characters). Any
ideas on how to do this please?

public class UTF8Test {
public UTF8Test() {
}

public String getString() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
return builder.toString();
}

public static void main(String[] args) {
UTF8Test test = new UTF8Test();
System.out.println(test.getString());
}
}

FWIW, the reason I want to do this is I need to write strings like
this to a sybase table where the column is of type varchar. We cannot
make it univarchar (don't ask). So I need to be able to write unicode
characters without using unicode chars! I thought by having them in
this expanded form java can convert them just like the program above
does.

Regards,

Andrew Marlow

Knute Johnson · Aug 6, 2009

Andrew said:
Hello,

I have an example program below that contains weird Icelandic
characters, and a copyright symbol, just for good measure. The code
expresses these as UTF8. They print exactly as you would want/expect
them to. So far so good. But what I want is to be able to go the other
way. I want to take a unicode string and recreate the escape sequences
for the funny international characters.For example, the single
character E-acute should be expanded to \u00C9 (6 characters). Any
ideas on how to do this please?

public class UTF8Test {
public UTF8Test() {
}

public String getString() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
return builder.toString();
}

public static void main(String[] args) {
UTF8Test test = new UTF8Test();
System.out.println(test.getString());
}
}

FWIW, the reason I want to do this is I need to write strings like
this to a sybase table where the column is of type varchar. We cannot
make it univarchar (don't ask). So I need to be able to write unicode
characters without using unicode chars! I thought by having them in
this expanded form java can convert them just like the program above
does.

Regards,

Andrew Marlow

public class UTF8Test {
public UTF8Test() {
}

public void doit() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass and
it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
String str = builder.toString();

System.out.println(str);

byte[] buf = str.getBytes();
for (byte b : buf)
System.out.printf("\\u%04x",b);
}

public static void main(String[] args) {
UTF8Test test = new UTF8Test();
test.doit();
}
}

C:\Documents and Settings\Knute Johnson>java UTF8Test
Copyright âŒ 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
â•”g get etiâ‰¡ gler ÃŸn â– ess aâ‰¡ meiâ‰¡a mig
\u0043\u006f\u0070\u0079\u0072\u0069\u0067\u0068\u0074\u0020\u00a9\u0020\u0032\u
0030\u0030\u0039\u000a\u0048\u0065\u0072\u0065\u0020\u0069\u0073\u0020\u0074\u00
68\u0065\u0020\u0070\u0068\u0072\u0061\u0073\u0065\u0020\u0028\u0069\u006e\u0020
\u0049\u0063\u0065\u006c\u0061\u006e\u0064\u0069\u0063\u0029\u003a\u0020\u0049\u
0020\u0063\u0061\u006e\u0020\u0065\u0061\u0074\u0020\u0067\u006c\u0061\u0073\u00
73\u0020\u0061\u006e\u0064\u0020\u0069\u0074\u0020\u0064\u006f\u0065\u0073\u006e
\u0027\u0074\u0020\u0068\u0075\u0072\u0074\u0020\u006d\u0065\u000a\u00c9\u0067\u
0020\u0067\u0065\u0074\u0020\u0065\u0074\u0069\u00f0\u0020\u0067\u006c\u0065\u00
72\u0020\u00e1\u006e\u0020\u00fe\u0065\u0073\u0073\u0020\u0061\u00f0\u0020\u006d
\u0065\u0069\u00f0\u0061\u0020\u006d\u0069\u0067

Andrew · Aug 6, 2009

C:\Documents and Settings\Knute Johnson>java UTF8Test
Copyright âŒ 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
â•”g get etiâ‰¡ gler ÃŸn â– ess aâ‰¡ meiâ‰¡a mig
\u0043\u006f\u0070\u0079\u0072\u0069\u0067\u0068\u0074\u0020\u00a9\u0020\u0032\u
0030\u0030\u0039\u000a\u0048\u0065\u0072\u0065\u0020\u0069\u0073\u0020\u0074\u00
68\u0065\u0020\u0070\u0068\u0072\u0061\u0073\u0065\u0020\u0028\u0069\u006e\u0020
\u0049\u0063\u0065\u006c\u0061\u006e\u0064\u0069\u0063\u0029\u003a\u0020\u0049\u
0020\u0063\u0061\u006e\u0020\u0065\u0061\u0074\u0020\u0067\u006c\u0061\u0073\u00
73\u0020\u0061\u006e\u0064\u0020\u0069\u0074\u0020\u0064\u006f\u0065\u0073\u006e
\u0027\u0074\u0020\u0068\u0075\u0072\u0074\u0020\u006d\u0065\u000a\u00c9\u0067\u
0020\u0067\u0065\u0074\u0020\u0065\u0074\u0069\u00f0\u0020\u0067\u006c\u0065\u00
72\u0020\u00e1\u006e\u0020\u00fe\u0065\u0073\u0073\u0020\u0061\u00f0\u0020\u006d
\u0065\u0069\u00f0\u0061\u0020\u006d\u0069\u0067

Well, thanks for the quick reply, but that hasn't quite worked has it?
All the chars have come out as \uxxxx. I want the ones that are 7 bit
ASCII to come out as the normal printable char, i.e I want the output
of doit to be:

Copyright \u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
me
\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0 mei\u00F0a mig

Arne Vajhøj · Aug 6, 2009

Andrew said:
I have an example program below that contains weird Icelandic
characters, and a copyright symbol, just for good measure. The code
expresses these as UTF8. They print exactly as you would want/expect
them to. So far so good. But what I want is to be able to go the other
way. I want to take a unicode string and recreate the escape sequences
for the funny international characters.For example, the single
character E-acute should be expanded to \u00C9 (6 characters). Any
ideas on how to do this please?

public class UTF8Test {
public UTF8Test() {
}

public String getString() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
return builder.toString();
}

public static void main(String[] args) {
UTF8Test test = new UTF8Test();
System.out.println(test.getString());
}
}

FWIW, the reason I want to do this is I need to write strings like
this to a sybase table where the column is of type varchar. We cannot
make it univarchar (don't ask). So I need to be able to write unicode
characters without using unicode chars! I thought by having them in
this expanded form java can convert them just like the program above
does.

The specific question asked can be solved with something like:

public static String encode(String s) {
StringBuffer sb = new StringBuffer("");
for(int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if((c >= 0) && (c <=127)) {
sb.append(c);
} else {
String hex = Integer.toHexString(c);
sb.append("\\u" + "0000".substring(hex.length(), 4) + hex);
}
}
return sb.toString();
}

But it will actually also require some work to decode it. Because the
unescape done in your code is done at compile time not runtime.

And 1 code point -> 6 bytes is not a very efficient encoding.

Assuming your VARCHAR supports 0-255 then you should be able
to store you UTF-8 bytes as ISO-8859-1.

A bit messy but more efficient space wise and less code.

Alternatively you could look at Quoted Printable but that
will also have overhead.

Arne

Mayeul · Aug 6, 2009

Andrew said:
Hello,

I have an example program below that contains weird Icelandic
characters, and a copyright symbol, just for good measure. The code
expresses these as UTF8. They print exactly as you would want/expect
them to. So far so good. But what I want is to be able to go the other
way. I want to take a unicode string and recreate the escape sequences
for the funny international characters.For example, the single
character E-acute should be expanded to \u00C9 (6 characters). Any
ideas on how to do this please?

public class UTF8Test {
public UTF8Test() {
}

public String getString() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
return builder.toString();
}

public static void main(String[] args) {
UTF8Test test = new UTF8Test();
System.out.println(test.getString());
}
}

You might want to read on UTF-8, as something like \u00C9 has absolutely
nothing to do with UTF-8. It is the Java escape notation which enables
to represent a character with its Unicode code point as hexadecimal.
Nothing to do with UTF-8. A lot to do with UTF-16, though.

As a side note, please be aware that Java Strings are sequences of Java
char values. Char values are unsigned and 16-bit, which is not enough to
hold characters with a Unicode code point above U+FFFF. Such characters
are therefore encoded as a combination of two Java chars, in the same
way UTF-16 works.
This won't impact what you're trying to do though, since UTF-16 use
surrogate characters that are still non-ASCII for characters above
U+FFFF. Their correct escape sequence is the horrible \uAAAA\uBBBB, the
escape sequences of the surrogates. Not addressing the issue at all will
automagically produce the desired results.

As for how to do encode to or decode from such a format, I don't know of
any direct way, but Knute and Arne showed it should be rather
straightforward.

FWIW, the reason I want to do this is I need to write strings like
this to a sybase table where the column is of type varchar. We cannot
make it univarchar (don't ask). So I need to be able to write unicode
characters without using unicode chars!

I recommand you store them encoded in UTF-7 or quoted-printable, then.
This will be more efficient and more standard than what you're trying to
do, and libraries will do it for you.

I thought by having them in
this expanded form java can convert them just like the program above
does.

As far as I know, you were wrong when thinking that.

Knute Johnson · Aug 6, 2009

Andrew said:
> Well, thanks for the quick reply, but that hasn't quite worked has it?
All the chars have come out as \uxxxx. I want the ones that are 7 bit
ASCII to come out as the normal printable char, i.e I want the
output of doit to be:

Copyright \u00A9 2009 Here is the phrase (in Icelandic): I can eat
glass and it doesn't hurt me \u00C9g get eti\u00F0 gler \u00E1n
\u00FEess a\u00F0 mei\u00F0a mig

Well I figured since you had a fairly sophisticated question and
appeared to have some knowledge of Java that you could figure out how to
use the 'if' statement yourself. Oh and just so you don't complain that
I used lower case hex, I fixed that too.

C:\Documents and Settings\Knute Johnson>java UTF8Test
Copyright âŒ 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
â•”g get etiâ‰¡ gler ÃŸn â– ess aâ‰¡ meiâ‰¡a mig
Copyright \u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0 mei\u00F0a mig

public class UTF8Test {
public UTF8Test() {
}

public void doit() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
String str = builder.toString();

System.out.println(str);

byte[] buf = str.getBytes();
for (byte b : buf) {
if ((b & 0x80) == 0)
System.out.print(new String(new byte[] { b }));
else
System.out.printf("\\u%04X",b);
}
}

public static void main(String[] args) {
UTF8Test test = new UTF8Test();
test.doit();
}
}

Roedy Green · Aug 6, 2009

I want to take a unicode string and recreate the escape sequences
for the funny international characters.For example, the single
character E-acute should be expanded to \u00C9 (6 characters). Any
ideas on how to do this please?

Another way of formulating your question is how to I take some
Unicode-16 data in RAM and write it out in 8-bit Icelandic encoding or
possibly UTF-8 encoding.

See http://mindprod.com/applet/file.html

See http://mindprod.com/jgloss/encoding.html
to find the name of the possible Icelandic encodings.

See http://mindprod.com/applet/encodingrecogniser.html
To help you figure out which Icelandic encoding you sample is using.

P.S. none of these codes is "visual". Turning these codes to glyphs is
the job of the font. See
http://mindprod.com/jgloss/font.html
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Let us pray it is not so, or if it is, that it will not become widely known."
~ Wife of the Bishop of Exeter on hearing of Darwin's theory of the common descent of humans and apes.

Andrew · Aug 6, 2009

You might want to read on UTF-8, as something like \u00C9 has absolutely
nothing to do with UTF-8. It is the Java escape notation which enables
to represent a character with its Unicode code point as hexadecimal.
Nothing to do with UTF-8. A lot to do with UTF-16, though.

Yes, ahem, you're right.

As for how to do encode to or decode from such a format, I don't know of
any direct way, but Knute and Arne showed it should be rather
straightforward.

I am not sure about those solutions. Don't I need to convert the
internal representation to something specific first, like UTF8? Or is
there a formal definition of the internal representation whee no
explicit encoding is given?

I recommand you store them encoded in UTF-7 or quoted-printable, then.
This will be more efficient and more standard than what you're trying to
do, and libraries will do it for you.

If I store the data in a varchar as this:

Copyright \u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
me
\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0 mei\u00F0a mig

then java will do the working of conversion for me automatically.
That's why I need to move in the other direction first.

As far as I know, you were wrong when thinking that.

I think I am right. When the \uxxxx strings are in a file and I read
them in, printing gives the correct result. Therefore reading from a
varchar should also give the correct result.

Andrew · Aug 6, 2009

Another way of formulating your question is how to I take some
Unicode-16 data in RAM and write it out in 8-bit Icelandic encoding or
possibly UTF-8 encoding.

No, that is not my question. Icelandic was just an example. The point
is the data contains international characters. I don't know what
language the text will be in and I don't care. I just need to be able
to write it to the database without losing information but I cannot
make the column univarchar (for reasons I won't go into here).

Seehttp://mindprod.com/applet/encodingrecogniser.html
To help you figure out which Icelandic encoding you sample is using.

This is not the problem (but I appreciate the thought though....).

P.S. none of these codes is "visual". Turning these codes to glyphs is
the job of the font. Seehttp://mindprod.com/jgloss/font.html

By visual I meant NOT binary. I.e. I do not want to get to the raw bit
pattern that represents E-acute, I want the single char that is E-
acute to be mapped to 6 bytes of the form \uxxxx that is the
equivalent.

-Andrew M.

Andrew · Aug 6, 2009

Andrew wrote:

> Well, thanks for the quick reply, but that hasn't quite worked has it?

Well I figured since you had a fairly sophisticated question and
appeared to have some knowledge of Java that you could figure out how to
use the 'if' statement yourself. Oh and just so you don't complain that
I used lower case hex, I fixed that too.

public void doit() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
String str = builder.toString();

System.out.println(str);

byte[] buf = str.getBytes();
for (byte b : buf) {
if ((b & 0x80) == 0)
System.out.print(new String(new byte[] { b }));
else
System.out.printf("\\u%04X",b);
}

}

I do appreciate you trying to help but I'm afraid that code does not
do the job. When I run it, this is what I get:

Copyright \u00C2\u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
me
\u00C3\u0089g get eti\u00C3\u00B0 gler \u00C3\u00A1n \u00C3\u00BEess a
\u00C3\u00

For example, the copyright symbol comes out as 00C2 when I expect
00A9. The E-acute comes out as 00C3 where I expect 00C9.

-Andrew Marlow

Roedy Green · Aug 6, 2009

By visual I meant NOT binary. I.e. I do not want to get to the raw bit
pattern that represents E-acute, I want the single char that is E-
acute to be mapped to 6 bytes of the form \uxxxx that is the
equivalent.

If you store international characters in a database, you can do any of
the following

1. hand the database 16 bit Unicode and leave it up to it to convert
them to some compact form.

2. hand the database UTF-8. Tell the database you are giving it UTF-8
or raw bytes.

3. hand the database some other national encoding. Tell the database
you are giving it that encoding or raw bytes.

The problem is data in files is not self-identifying. HTTP has headers
to let you know the encoding though.

The usual way to handle a mixture of languages is to store 16-bit
Unicode in the database.

You said a few things that suggest you may have missed some of the
basics about encodings. See http://mindprod.com/jgloss/encoding.html
to fill in the holes.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Let us pray it is not so, or if it is, that it will not become widely known."
~ Wife of the Bishop of Exeter on hearing of Darwin's theory of the common descent of humans and apes.

markspace · Aug 6, 2009

Andrew said:
No, that is not my question. Icelandic was just an example. The point
is the data contains international characters. I don't know what
language the text will be in and I don't care.

This is a big problem. If you don't know what the encoding is, you have
binary, not text. You have to decode the text into Java strings or you
aren't going to be able to do anything with them, really.

If you're just storing binary as a string (which you are), consider base
64 encoding. It's easy to do and will always work. You should be able
to find some source code to do this, it's not hard to roll your own either.

If you must write your own \u encoder and decoder, don't forget that you
should probably encode the range from 0 to 31 as well as the range from
128 to 255. Plus you'll have to encode the \ char too, or reading
things back is going to be a pain.

I just need to be able
to write it to the database without losing information but I cannot
make the column univarchar (for reasons I won't go into here).

I don't know of any built in class that does this. You'll have to roll
your own, I think.

By visual I meant NOT binary. I.e. I do not want to get to the raw bit
pattern that represents E-acute, I want the single char that is E-
acute to be mapped to 6 bytes of the form \uxxxx that is the
equivalent.

You don't know "equivalent" unless you know what encoding you started
with, however.

Once you have the encoding, you can make a Java string, then do

byte[] binary = string.getBytes( "UTF-8" );

to encode the string into UTF-8 binary, but then you still have to store
the binary.

Just curious: what is driving the need for this "\u + UTF-8" encoding?
Is some other program reading the strings in this format? Or did you
just think it was a good idea and decide to encode these strings like
this on your own?

Knute Johnson · Aug 6, 2009

Andrew said:
Well I figured since you had a fairly sophisticated question and
appeared to have some knowledge of Java that you could figure out how to
use the 'if' statement yourself. Oh and just so you don't complain that
I used lower case hex, I fixed that too.

Click to expand...

public void doit() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
String str = builder.toString();

System.out.println(str);

byte[] buf = str.getBytes();
for (byte b : buf) {
if ((b & 0x80) == 0)
System.out.print(new String(new byte[] { b }));
else
System.out.printf("\\u%04X",b);
}

}

Click to expand...

I do appreciate you trying to help but I'm afraid that code does not
do the job. When I run it, this is what I get:

Copyright \u00C2\u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
me
\u00C3\u0089g get eti\u00C3\u00B0 gler \u00C3\u00A1n \u00C3\u00BEess a
\u00C3\u00

For example, the copyright symbol comes out as 00C2 when I expect
00A9. The E-acute comes out as 00C3 where I expect 00C9.

-Andrew Marlow

You saw it worked on my computer. So yours must be using a different
character set. You will have to adjust for that.

Tom Anderson · Aug 7, 2009

Alternatively you could look at Quoted Printable but that will also have
overhead.

Andrew, you should totally use quoted-printable (extended to 16- rather
than 8-bit values). Your unicode escape scheme is madness.

tom

Andrew · Aug 7, 2009

Andrew, you should totally use quoted-printable (extended to 16- rather
than 8-bit values). Your unicode escape scheme is madness.

tom

Er, why? I am only using the same escaping convention that java itself
uses. My example program shows the correct international text being
output when the java convention for escaping such characters is
employed.

Andrew · Aug 7, 2009

Andrew said:
Andrew said:

Andrew wrote:
> Well, thanks for the quick reply, but that hasn't quite worked has it?
All the chars have come out as \uxxxx. I want the ones that are 7 bit
ASCII to come out as the normal printable char, i.e I want the
output of doit to be:
Copyright \u00A9 2009 Here is the phrase (in Icelandic): I can eat
glass and it doesn't hurt me \u00C9g get eti\u00F0 gler \u00E1n
\u00FEess a\u00F0 mei\u00F0a mig
Well I figured since you had a fairly sophisticated question and
appeared to have some knowledge of Java that you could figure out how to
use the 'if' statement yourself. Oh and just so you don't complain that
I used lower case hex, I fixed that too.
public void doit() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
String str = builder.toString();
System.out.println(str);
byte[] buf = str.getBytes();
for (byte b : buf) {
if ((b & 0x80) == 0)
System.out.print(new String(new byte[] { b }));
else
System.out.printf("\\u%04X",b);
}
}

Click to expand...

Click to expand...

I do appreciate you trying to help but I'm afraid that code does not
do the job. When I run it, this is what I get:

Click to expand...

Copyright \u00C2\u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
me
\u00C3\u0089g get eti\u00C3\u00B0 gler \u00C3\u00A1n \u00C3\u00BEess a
\u00C3\u00

Click to expand...

For example, the copyright symbol comes out as 00C2 when I expect
00A9. The E-acute comes out as 00C3 where I expect 00C9.

Click to expand...

-Andrew Marlow

Click to expand...

You saw it worked on my computer. So yours must be using a different
character set. You will have to adjust for that.

Indeed, this is what I suspected and this is part of my point.
Whatever solution I wind up with it needs to be platform-independent.

neuneudr · Aug 7, 2009

What Java (brokenly) uses internally to represent String
shouldn't concern you.

Java was conceived with Unicode 3.0 in mind, when there
were less than 65536 'codepoints'.

Remember that you're not *ever* forced to use the broken
'char' primitive, which does *not* represent a character
anymore since Unicode 3.1 came out.

Java 1.5's String codePointAt(int it) is the method that
correctly returns a character, and is commented as doing
just that in the (correct) Javadoc.

The (broken being repair) charAt(int i) method is only
there for backward compatibility and shall continue to
mislead programmers thinking it does actually return
a character. The Javadoc clearly states that it returns
a char.

I don't care if internally Java uses UCS-2 and broken
chars to represents Unicode strings or the color of
moonboots little faeries are wearing.

What is important is the abstraction the String class
is offering.

charAt is there for backward compatibility reason and
is as much broken as the char primitive (the whole concept
of primitives being disputable in an OO language anyway
btw).

codePointAt is the method to get characters.

Now, if you want to have an ASCII Java source file containing
Unicode characters (for String or in comments), you ll have
to use the creative (but broken) uXXXX escaping but this is
another Java weirdity that should not pollute
the DB you re using.

If you really need to escape your Unicode string in your DB
then at least don't pollute your DB with Java-specific
weirdities.

I 100% agree with Mayeul.

uXXXX escaping has exactly *nothing* to do with UTF-8.

A Unicode character is a Unicode character and the broken
internal representation that Java uses to store Unicode
strings and the broken Java char primitive (and overall
broken primitive concept in an OO language) should be
of no concern to you.

The only thing that count is the abstraction that the
String class is offering (dropping the broken methods
present for backward compatibility), not the internal
representation that the JVM is using.

Who s going to query that Sybase DB? Only your Java
app?

Non Java-apps are going to use that DB? How are they
going to deal with the escaping scheme you'll come
with?

Reproducing in your DB the uXXX/uYYYY escaping is
IMHO definitely not the way to go.

Arne Vajhøj · Aug 7, 2009

Andrew said:
Er, why? I am only using the same escaping convention that java itself
uses.

Actually you are not.

You are doing runtime processing using that syntax.

Java uses that syntax at compile time.

That is a significant difference.

Arne

Arne Vajhøj · Aug 7, 2009

uXXXX escaping has exactly *nothing* to do with UTF-8.
Correct.

A Unicode character is a Unicode character and the broken
internal representation that Java uses to store Unicode
strings and the broken Java char primitive (and overall
broken primitive concept in an OO language) should be
of no concern to you.

Java uses the same concept as other widely used languages.

Non Java-apps are going to use that DB? How are they
going to deal with the escaping scheme you'll come
with?

The exact same way Java would. Parse it.

If the other language is of C heritage, then the
code would almost be the same.

Arne

Arne Vajhøj · Aug 7, 2009

Tom said:
Andrew, you should totally use quoted-printable (extended to 16- rather
than 8-bit values).

I would suggest standard QP on UTF-8 encoding instead of a custom QP.

Your unicode escape scheme is madness.

At least rather cumbersome.

Arne

how do I read and write a file using UTF8?	12	May 13, 2009
FAQ 4.25 How do I expand tabs in a string?	0	Feb 11, 2011
How can I get a character, given its Unicode index?	5	Aug 30, 2009
How do I display unicode value stored in a string variable using ord()	133	Aug 16, 2012
Why can't I set sys.ps1 to a unicode string?	3	Aug 12, 2010
Converting EBCDIC to Unicode	3	Sep 28, 2010
q: how to output a unicode string?	5	Apr 24, 2007
How do I rewrite this in a cleaner way?	6	Jul 29, 2010

how do I expand a unicode string to its visual UTF8 representation?

Andrew

Knute Johnson

Andrew

Arne Vajhøj

Mayeul

Knute Johnson

Roedy Green

Andrew

Andrew

Andrew

Roedy Green

markspace

Knute Johnson

Tom Anderson

Andrew

Andrew

neuneudr

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads