Slightly tricky string problem

  • Thread starter Dirk Bruere at NeoPax
  • Start date
M

Mike Schilling

Dirk said:
... which I'm having trouble getting my head around.

I have a String, which is single character eg "a"
I need to convert it to a String which is the decimal representation
of the UTF8 ascii code ie "97"

If you know it's a single ASCII character,

String s_a = "a";
String s_b = Integer.toString((int)s_a.charAt(0));

This could be generalized to non-ASCII characters or multiple
characters, if I knew what the desired result was.
 
D

Dirk Bruere at NeoPax

M

Mayeul

Mark said:
I don't think this actually gives UTF-8, just Java's internal Unicode,
whatever that happens to be.

As indicated in java.lang.Character javadoc, a char value represents a
Unicode code point in the BMP.

So, for characters in the BMP, Java's Unicode is just plain expected
Unicode.

As for characters outside the BMP, you would need two Java chars to
represent them, in a UTF-16 way.

Conclusion: as long as we're speaking ASCII, the given method works.
Outside ASCII but still in the BMP, the given method will produce the
character's code point.
But one might wonder what "UTF-8 ascii code" is, and what to do with
non-ASCII characters, as they would be represented in more than one byte
in UTF-8.
 
C

charlesbos73

... which I'm having trouble getting my head around.

I have a String, which is single character eg "a"
I need to convert it to a String which is the decimal representation of
the UTF8 ascii code ie "97"

What did you do to try to solve your problem?

As Mayeul pointed out, "UTF8 ascii code" [sic] doesn't mean anything.

ASCII is a code defining 128 entities, which are usually represented
each on 8 bits, with the most significant bit set to 0. But in any
case "ASCII the characters" should not be mistaken with "ASCII the
encoding".

Same for Unicode.

Unicode defines much more entities (called codepoints).
The 128 first Unicode entities are the 128 ASCII entities.

UTF-8 is an encoding that has been created so that any byte
with the most significant bit set to 0 is an ASCII entity.

So an UTF-8 encoded file containing only ASCII characters shall
be the same as an ASCII encoded file.

But in your case, if you have a String [sic] you shouldn't
care at all about encoding details: UTF-8 or little faeries
wearing boots drawing you characters using magical powder has
no importance.

Things get quickly messy in Java because when Java was created
Unicode didn't define codepoints outside the BMP. So we end
up with a backward compatible charAt(..) method that is broken
beyond repair because it definitely does NOT give back the
character at 'x' when you have a String that contains characters
outside the BMP.

All hope is not lost that said, for we now have the codePointAt(..)
method which works correctly for codepoints outside the BMP, as
shown in the example below:

@Test public void tests() {
assertEquals( Integer.toString("\u0000".codePointAt(0)),
"0" );
// Java offers no easy way to source code encode, say, U+1040B
(dec 66571)
assertEquals( Integer.toString("\uD801\uDC0B".codePointAt(0)),
"66571" ); // 0x1040B (hex) 66571 (dec)
assertEquals( Integer.toString("a".codePointAt(0)), "97" );
}

If you're curious as to how to do what Integer.toString(..) does
you can look at the source code for the Integer class.

Note that Integer.toString(int) works as expected on
entities outside the BMP:

Integer.toString("\uD801\uDC0B".codePointAt(0))

gives back the expected "66571" string.

By now you can expect the "JLS-nazi bot" (that shall recognize
itself) to nitpick on grammatical mistakes and claim loud
that Java is perfect and that the fact that we have both a
(broken) charAt(..) method and codePointAt(..) is not a
problem at all.

But as usual the "JLS-nazi bot"'s deranged ramblings shall be
sent to /dev/null without any consideration.
 
C

charlesbos73

As indicated in java.lang.Character javadoc, a char value represents a
Unicode code point in the BMP.

So, for characters in the BMP, Java's Unicode is just plain expected
Unicode.

As for characters outside the BMP, you would need two Java chars to
represent them, in a UTF-16 way.

Conclusion: as long as we're speaking ASCII, the given method works.
Outside ASCII but still in the BMP, the given method will produce the
character's code point.

I wholeheartly agree with your post.

Minor remark: it only happens to produce the character's codepoint
in the BMP because it's taking the first character of the string.
Had it been charAt(1) or anything else than 0 and even in the BMP
it's not guaranteed to work (because if, say, the first character
of the string is outside the BMP charAt is broken).

But by simply replacing charAt by codePointAt, the method will produce
the character's codepoint even if it's outside the BMP (and even
if we're taking a 'character' that is not the first of the string).

But one might wonder what "UTF-8 ascii code" is, and what to do with
non-ASCII characters, as they would be represented in more than one byte
in UTF-8.

exactly
 
M

Mike Schilling

Mark said:
I don't think this actually gives UTF-8, just Java's internal
Unicode,
whatever that happens to be.

It's the same for ASCII characters (<=127), which is why I said, in
the part you clipped, that this works only for them.
 
M

Mayeul

Mark said:
I think the OP wants UTF-8, not the UTF-16 code point. I'm assuming his
request for "ASCII" was a misstatement. Outside of the first 127
characters, charAt(int) won't yield UTF-8.

If I had to guess I'd think the OP is confusing UTF-8 with Unicode, and
describes "the number associated with a character" when saying "UTF-8
ascii code". But that is a guess.

I would rather point out the fact that we don't actually know what the
OP meant.

I also wanted to point out that "whatever Java's internal Unicode is" is
actually plainly expected Unicode in the BMP. Therefore Mike's
suggestion *might* have been correct. Yours too, only the OP could
possibly know.
 
M

Mark Space

Mayeul said:
I also wanted to point out that "whatever Java's internal Unicode is" is
actually plainly expected Unicode in the BMP. Therefore Mike's


Well, I think Java's internal encoding used to be USC-2 but is now
UTF-16. The two are different, and depending on exactly which JVM you
have, the encoding might be neither I suppose. Just pointing out that
you really can't rely 100% on those internal codes.

This doesn't affect the first 127 code points of course, but I think
charAt(int) is too brittle unless you're certain of the source. Given
the OP mixed the term "UTF-8" in there, I'd rather show him the most
robust method.
 
D

Dirk Bruere at NeoPax

Mayeul said:
If I had to guess I'd think the OP is confusing UTF-8 with Unicode, and
describes "the number associated with a character" when saying "UTF-8
ascii code". But that is a guess.

I would rather point out the fact that we don't actually know what the
OP meant.

What I meant is the values listed here
http://www.asciitable.com/

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff
 
D

Dirk Bruere at NeoPax

charlesbos73 said:
... which I'm having trouble getting my head around.

I have a String, which is single character eg "a"
I need to convert it to a String which is the decimal representation of
the UTF8 ascii code ie "97"

What did you do to try to solve your problem?

As Mayeul pointed out, "UTF8 ascii code" [sic] doesn't mean anything.

ASCII is a code defining 128 entities, which are usually represented
each on 8 bits, with the most significant bit set to 0. But in any
case "ASCII the characters" should not be mistaken with "ASCII the
encoding".

Same for Unicode.

Unicode defines much more entities (called codepoints).
The 128 first Unicode entities are the 128 ASCII entities.

UTF-8 is an encoding that has been created so that any byte
with the most significant bit set to 0 is an ASCII entity.

So an UTF-8 encoded file containing only ASCII characters shall
be the same as an ASCII encoded file.

But in your case, if you have a String [sic] you shouldn't
care at all about encoding details: UTF-8 or little faeries
wearing boots drawing you characters using magical powder has
no importance.

It is when I have a protocol that interfaces with a machine that only
accepts ascci encoded strings. So UTF8 would be a good starting point.
Things get quickly messy in Java because when Java was created
Unicode didn't define codepoints outside the BMP. So we end
up with a backward compatible charAt(..) method that is broken
beyond repair because it definitely does NOT give back the
character at 'x' when you have a String that contains characters
outside the BMP.

All hope is not lost that said, for we now have the codePointAt(..)
method which works correctly for codepoints outside the BMP, as
shown in the example below:

@Test public void tests() {
assertEquals( Integer.toString("\u0000".codePointAt(0)),
"0" );
// Java offers no easy way to source code encode, say, U+1040B
(dec 66571)
assertEquals( Integer.toString("\uD801\uDC0B".codePointAt(0)),
"66571" ); // 0x1040B (hex) 66571 (dec)
assertEquals( Integer.toString("a".codePointAt(0)), "97" );
}

If you're curious as to how to do what Integer.toString(..) does
you can look at the source code for the Integer class.

Note that Integer.toString(int) works as expected on
entities outside the BMP:

Integer.toString("\uD801\uDC0B".codePointAt(0))

gives back the expected "66571" string.

By now you can expect the "JLS-nazi bot" (that shall recognize
itself) to nitpick on grammatical mistakes and claim loud
that Java is perfect and that the fact that we have both a
(broken) charAt(..) method and codePointAt(..) is not a
problem at all.

But as usual the "JLS-nazi bot"'s deranged ramblings shall be
sent to /dev/null without any consideration.

Thanks.
Right now my problem is lack of full definition of the protocol, so I'll
have to return to this later.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff
 
M

Mike Schilling

Mark said:
Well, I think Java's internal encoding used to be USC-2 but is now
UTF-16. The two are different, and depending on exactly which JVM
you
have, the encoding might be neither I suppose. Just pointing out
that
you really can't rely 100% on those internal codes.

Older versions of Java didn't support surrogates; current ones do. (I
don't know where the dividing line is.) If a code point is in the
BMP, its Java "char" value didn't change between the two. If a code
point is outside the BMP, it couldn't be representeed by these older
versions of Java. In neither case did a preexisting value change.
 
M

Mark Space

Dirk said:
What I meant is the values listed here
http://www.asciitable.com/


What happens if the string contains characters that are outside that range?

String s = "\u0080";
System.out.println((int)s.charAt(0));
System.out.println(Arrays.toString(s.getBytes("UTF-8")));


run:
128
[-62, -128]
BUILD SUCCESSFUL (total time: 0 seconds)
 
D

Dirk Bruere at NeoPax

Mark said:
Dirk said:
What I meant is the values listed here
http://www.asciitable.com/


What happens if the string contains characters that are outside that range?

String s = "\u0080";
System.out.println((int)s.charAt(0));
System.out.println(Arrays.toString(s.getBytes("UTF-8")));


run:
128
[-62, -128]
BUILD SUCCESSFUL (total time: 0 seconds)

I don't know - that's tomorrows problem:-(

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff
 
M

Mark Space

Dirk said:
I don't know - that's tomorrows problem:-(

It doesn't have to be a problem. getBytes() works as well as charAt()
for ASCII values, and can return proper utf-8 for other values. And
it's just as easy to implement, imo.
 
M

Mark Space

Dirk said:
It is when I have a protocol that interfaces with a machine that only
accepts ascci encoded strings. So UTF8 would be a good starting point.
Right now my problem is lack of full definition of the protocol, so I'll
have to return to this later.


And reading this, I think I should point out that there's a lot more
character encodings available to getBytes() besides UTF-8.

getBytes("ASCII");

will I believe reject any characters that are out of range for ASCII. I
don't recall what it does if the character is out of range (throws an
error? Replaces it with a "?" That's what Java docs are for) but that
sounds safer if you really don't want to deal with non-ASCII values.
 
D

Dirk Bruere at NeoPax

Mark said:
It doesn't have to be a problem. getBytes() works as well as charAt()
for ASCII values, and can return proper utf-8 for other values. And
it's just as easy to implement, imo.

I'll read all the replies again a bit later and get to understand it
properly. Meanwhile, I have to do a bit of tedious but straightforward
coding to "show progress". Another deadline approaching.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Change character in string 105
Official Java Classes 10
Can an Applet beep? 4
ListModel name 10
JMF? 21
Sorting a JList 4
Substring 53
Send string to IP address 17

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,744
Latest member
CortneyMcK

Latest Threads

Top