HELP: Unicode in Java 1.3.1 vs 1.4.2

M

modest

Hi All,

according to
http://java.sun.com/docs/books/tutorial/i18n/text/string.html:

"If a byte array contains non-Unicode text, you can convert the text to
Unicode with one of the String constructor methods. Conversely, you can
convert a String object into a byte array of non-Unicode characters
with the String.getBytes method. When invoking either of these methods,
you specify the encoding identifier as one of the parameters."

It works fine in Java 1.3.1

------------------------------------------------------------------
// Convert ASCII to Unicode
str_uni = new String(str_ascii.getBytes(), "ISO8859_8");

// Convert Unicode to ASCII
str_ascii = new String(str_uni.getBytes("ISO8859_8"));
------------------------------------------------------------------

In Java 1.4.2 it returns question marks only.

What is the difference and how it can be fixed?

I need the solution URGENTLY.

thanks,
L.
 
C

Chris Uppal

modest said:
// Convert ASCII to Unicode
str_uni = new String(str_ascii.getBytes(), "ISO8859_8");

// Convert Unicode to ASCII
str_ascii = new String(str_uni.getBytes("ISO8859_8"));

This seems confused. Character data that is encoded via some charset is -- by
definition -- a sequence of bytes, and so an appropriate representation in Java
is as a byte[] or similar. /Not/ as a java.lang.String, Strings /only/ hold
Unicode data.

Consider the following sequence of steps, based on the above:

String str_ascii = "hello";

At this point str_uni is a Unicode string. That's fine. As it happens the data
it contains can be encoded as ASCII so that's fine too, but that does not mean
that it /is/ encoded as ASCII at this point.

Now we do:

String str_uni = new String(str_ascii.getBytes(), "ISO8859_8");

that's in two step, the first:

str_ascii.getBytes()

returns a new byte[] with the characters encoded in the platform default
encoding -- whatever that might happen to be in your installation. Since we
don't know what the encoding is, it might just as well be gibberish. The next
step is to pass the byte[] to the String constructor:

String str_uni = new String(theByteArray, "ISO8859_8");

which will interpret the bytes as an ISO-8858-1 encoding, and create a new
Unicode string by decoding those bytes. However, there's no reason to suppose
that the bytes /were/ encoded in ISO-8859-8, so it's difficult to tell what the
results might be.

Now gong the other way, we have similar problems. First we get the bytes:

str_uni.getBytes("ISO8859_8")

which will return a byte[] properly representing the contents of the String as
ISO-8859-8. (But if the String already contained garbage then the answer will
be garbage too). Then we do:

str_ascii = new String(theByteArray);

which creates a new String by interpretting the contents of the byte[] array as
if it were encoded in the platform default charset. Again we don't know what
that is, but if it's not ISO-8859-8 then its likely to produce garbage --
because ISO-8859-8 is how the bytes /are/ encoded.

In Java 1.4.2 it returns question marks only.

What is the difference

At a guess its because with 1.3 you had a platform default charset that allowed
the above to work "by accident" (for instance if the platorm default was
ISO-8859-8, then the above conversions would cancel out and do nothing, but
wouldn't create garbage), but the default on your 1.4 installation is
different.

and how it can be fixed?

Sort out what you are really trying to do, and do that. Your code snippets
don't do what you think they do. It will help if you /never/ put byte data
(such as String data encoded with some charset) into a String object -- you
will only confuse yourself. Also I'd suggest avoiding String<->byte[]
conversions that implicitly use the platform's default charset.

-- chris
 
M

modest

thanks, chris, I will try to fix it.

but how can I receive an ASCII string from byte[] taking in
consideration that new String(str_uni.getBytes("ISO8859_8"),
"ISO8859_8") will return Unicode? I have to convert Unicode string to
ASCII one.
 
J

John C. Bollinger

modest said:
according to
http://java.sun.com/docs/books/tutorial/i18n/text/string.html:

"If a byte array contains non-Unicode text, you can convert the text to
Unicode with one of the String constructor methods. Conversely, you can
convert a String object into a byte array of non-Unicode characters
with the String.getBytes method. When invoking either of these methods,
you specify the encoding identifier as one of the parameters."

It works fine in Java 1.3.1

------------------------------------------------------------------
// Convert ASCII to Unicode
str_uni = new String(str_ascii.getBytes(), "ISO8859_8");

// Convert Unicode to ASCII
str_ascii = new String(str_uni.getBytes("ISO8859_8"));
------------------------------------------------------------------

In Java 1.4.2 it returns question marks only.

What is the difference and how it can be fixed?

You are not using the canonical name of the charset, which is
"ISO-8859-8". Which charsets are available and how they are configured
depends on your Java installation. On my Sun JDK 1.4.2_05 installation,
the charset in question has no defined aliases and therefore can only be
referred to by its canonical name. I don't know why you are getting
anything at all in this case (you should get an
UnsupportedEncodingException if the charset name were unknown).

That said, your code is deeply flawed. If you have data in a Java
String then it is already Unicode, *that is a fundamental characteristic
of Java Strings*. It does not make sense to talk about changing the
encoding / charset of a String -- the concept just doesn't apply (and
the i18n tutorial refer to doesn't suggest otherwise). If you have
taken a byte sequence and created a String from it without accounting
for the bytes' charset then you are already hosed. This may be your
real problem, and it has not changed from 1.3 to 1.4 (or 1.5).

In addition, it might be relevant to you that ASCII, Unicode, and all
the ISO-8859 nationalized charsets all assign the same codes to the
characters covered by ASCII. The UTF-8 charset for encoding Unicode is
produces encoded character codes for the ASCII characters that are the
same as the character codes themselves.
 
C

Chris Uppal

John said:
You are not using the canonical name of the charset, which is
"ISO-8859-8". Which charsets are available and how they are configured
depends on your Java installation. On my Sun JDK 1.4.2_05 installation,
the charset in question has no defined aliases and therefore can only be
referred to by its canonical name.

Well spotted.

-- chris
 
C

Chris Uppal

modest said:
but how can I receive an ASCII string from byte[] taking in
consideration that new String(str_uni.getBytes("ISO8859_8"),
"ISO8859_8") will return Unicode? I have to convert Unicode string to
ASCII one.

I think you need to be extremely careful in defining what you are trying to do.

Here are some of the questions that may affect you:

What do you mean by ASCII ? Properly speaking ASCII codes run in the range
0..127, and the first 20 or so of those do not represent characters. OTOH many
people in practise tend to use ASCII to mean almost any 8-bit encoding that is
compatible with real ASCII over the range 0..127 and which "prints out OK" on
their own machine. (That's how /I/ tend to use the term anyway ;-) So that's
your first question, and the answer will depend on what you are going to do
with the "ASCII" data (or where it came from). Does it have to be transmitted
over a 7-bit communication medium, for instance ? If so then you probably are
wanting /real/ ASCII Or is it for display to a user, in which case you are
probably looking to translate it into the user's local codepage, or perhaps
into a default such ISO-8859-1

The second question is what do you want to do with characters in a Java String
that cannot be represented in the target code page ? You might want to throw
an exception, or to filter them out, or to replace them with '?' question
marks. Alternatively, you might be wanting to preserve the information at all
costs, in which case you might need to use an encoding like UTF-8 which "looks
like" ASCII in that it represents characters in the range 0...127 by the same
numbers (so pure ASCII text is unaffected) but represents characters outside
that range by longer sequences of two or more bytes. But that will be no use
unless the receiver of the data will be able to decode it.

Lastly you have to decide how you want to handle your data. I've suggested
that it's best to use java.lang.Strings /only/ for character data that has been
decoded (so it's not in ASCII or ISO-8859-8 or UTF-8 or any other encoding --
just pure, encoding-free, Unicode data). If you do that then you can read in
binary data in any of the encodings (including ASCII) and convert that to
Unicode which you use for all your internal processing, and then (if necessary)
convert it to some encoding (perhaps ASCII again) before you write it out.

But it's possible that you don't want to work like that. You might want to
work with encoded data throughout (ASCII or whatever other encoding). In that
case I think you'd be better off sticking to a representation that was based on
byte[], rather than using Strings or you'll risk double-encoding or similar
encoding errors. If that makes development too hard (after all it's difficult
to read text expressed as an array of bytes) then you might be forced to use
Strings but with the restriction that no String is allowed to contain
characters that can't be represented in the encoding you want to use (e.g. only
ASCII characters, or only characters representable in ISO-8859-1). That risks
causing some confusion, and it's important that you get your terminology
correct. There is no such thing as an ASCII String, there are only Strings
(which contain Unicode). Some Strings only use characters that can be
represented as ASCII, but it doesn't help to call them ASCII Strings -- they
are still Unicode.

I realise that I've not answered you actual question here. The fact is that I
can't -- you'll have to work out what you are trying to do first (and there's
no guarantee that I'll know the answers after you've worked it out, but I'd bet
that someone around here would).

-- chris
 
J

John C. Bollinger

modest said:
but how can I receive an ASCII string from byte[] taking in
consideration that new String(str_uni.getBytes("ISO8859_8"),
"ISO8859_8") will return Unicode? I have to convert Unicode string to
ASCII one.

You are still missing the point. It may help if you avoid thinking of
any particular sequence of bytes as a string -- from a Java perspective,
there is only a loose relationship between byte sequences and String
objects, at best.

You *must* also give up the misconception of an "ASCII string" as any
kind of relevant description of a String object. It might even help to
ignore the Unicode character of String data. What is important is that
Strings consist of *characters*, which are not at all the same thing as
bytes. Java characters come in only one flavor (Unicode, but that's
less important than the fact that there's only one). Once you get used
to this, it actually makes i18n considerations a lot easier to manage.

If you have a sequence of characters encoded into a byte sequence
according to some charset, then you can attempt to convert it into a
sequence of bytes encoded according to some different charset by means
of an intermediate String object. It would look something like this:

byte[] iso8859_8bytes;
[... get the bytes from somewhere ...]
String myString = new String(iso8859_8bytes, "ISO-8859-1");
byte[] asciiBytes = myString.getBytes("US-ASCII");

Note, however, that this will produce encoded '?' characters in place of
any encoded characters from the original byte sequence that are not
ASCII characters. In fact, for these particular two charsets, that will
be the *only* difference between the original byte sequence and the new one.
 
T

Thomas Fritsch

modest said:
thanks, chris, I will try to fix it.

but how can I receive an ASCII string from byte[] ...
Again: You cannot get an *ASCII* string from byte[],
just because there is no such thing as an ASCII string.
Strings are always unicode internally (see Chris Uppal's answer).
What you can do is,
either make a String from an ASCII byte[] array:
byte asciiBytes[] = ...
String s = new String(asciiBytes, "ASCII");
or make an ASCII byte[] array from a String:
String s = ...
byte asciiBytes[] s.getBytes("ASCII");
But I don't really know what is your goal.
... taking in
consideration that new String(str_uni.getBytes("ISO8859_8"),
"ISO8859_8") will return Unicode? I have to convert Unicode string to
ASCII one.
You can convert between Strings and ISO8859_8-encoded byte[] arrays in
the same way as described above. Simply use "ISO8859_8" instead of "ASCII".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,992
Messages
2,570,220
Members
46,807
Latest member
ryef

Latest Threads

Top