Reading a unicode (UTF-16) file

S

Steve

Hi,

I have a text file with some french lines saved in the UTF-16 format.
Most of the words can be represented easily with a byte but some special
accept alphabets are 16 bits. I'm trying to read this file in from a
MIDLET and so the only way I can do this is:

InputStream is = this.getClass().getResourceAsStream("file.txt");
byte[] data = new byte[50000];
is.read(data);

I am trying to draw each line on the canvas using the drawString
Graphics method. However, I am having difficulty finding 16bit letters
from the text file and displaying them correctly. If I convert each byte
to a char, most letters pass through except the 'special' ones, for
which I think I need to take the next byte in the stream and join it
together with the previous one. Even then, the problem is that Unicode
characters need to be represented like the following: '\u0045'. Now
since I am storing each character of a single line in a char array, I
can't seem to "join" two bytes and then add '\u' in front - it fails and
complains that this isn't a character but a string. I hope I was able to
express my problem clearly. This is what I'm doing:

char[] line = new line[150];
for (int i=0;i<150;i++) {
line = (char)data[lineOffset+i];
}

If I print line with Graphics.drawChars, it works but gives me weird
characters in place of actual single unicode characters - for a single
unicode character it gives me two characters.

In short, how do I tell that I'm about to stumble on a unicode character
by looking at the byte being returned and how can I 'join' two bytes to
represent a single character. Any help would be most appreicated.

Thanks,
Steve
 
R

Roedy Green

I have a text file with some french lines saved in the UTF-16 format.
Most of the words can be represented easily with a byte but some special
accept alphabets are 16 bits. I'm trying to read this file in from a
MIDLET and so the only way I can do this is:

That sounds much more like UTF-8 than UTF-16.

I suggest reading http://mindprod.com/jgloss/utf.html
http://mindprod.com/jgloss/encoding.html
and
http://mindprod.com/applets/fileio.html on how to read encoded data.

If you have no way to read UTF-8 in a midlet, then you could roll your
own. I show you the encoding code. Follow the links and you should
find an algorithm description of the decoding.

I find that strange. If midlets could read ANYTHING it would expect it
would be UTF-8.
 
R

Roedy Green

If I print line with Graphics.drawChars, it works but gives me weird
characters in place of actual single unicode characters - for a single
unicode character it gives me two characters.

Separate your drawing problem from your reading the encoded file
problem. Dump out the 16-bit decoded chars in hex to make sure they
are plausible. You also need a font that will support those
characters. See http://mindprod.com/applets/fontshowerawt.html
for help finding a good font.
 
O

Oliver Wong

Steve said:
Hi,

I have a text file with some french lines saved in the UTF-16 format. Most
of the words can be represented easily with a byte but some special accept
alphabets are 16 bits. I'm trying to read this file in from a MIDLET and
so the only way I can do this is:

InputStream is = this.getClass().getResourceAsStream("file.txt");
byte[] data = new byte[50000];
is.read(data);

I am trying to draw each line on the canvas using the drawString Graphics
method. However, I am having difficulty finding 16bit letters from the
text file and displaying them correctly. If I convert each byte to a char,
most letters pass through except the 'special' ones, for which I think I
need to take the next byte in the stream and join it together with the
previous one. Even then, the problem is that Unicode characters need to be
represented like the following: '\u0045'. Now since I am storing each
character of a single line in a char array, I can't seem to "join" two
bytes and then add '\u' in front - it fails and complains that this isn't
a character but a string. I hope I was able to express my problem clearly.
This is what I'm doing:

char[] line = new line[150];
for (int i=0;i<150;i++) {
line = (char)data[lineOffset+i];
}

If I print line with Graphics.drawChars, it works but gives me weird
characters in place of actual single unicode characters - for a single
unicode character it gives me two characters.

In short, how do I tell that I'm about to stumble on a unicode character
by looking at the byte being returned and how can I 'join' two bytes to
represent a single character. Any help would be most appreicated.


In UTF-16, all characters are encoded in 16 bits or longer. If most of
the latin characters are encoded as 8 bit, with a few accented characters
encoded in 16 bits, then you are probably working with UTF-8.

I'm not familiar with the Java2ME API, but wouldn't there a method to
automatically encode and decode for you?

As for knowing when you need to join 2 (or more) bytes together in
UTF-8, it's relatively easy: If the first bit is 1, then you need to join 2
bytes together. The number of bytes you need to join together is the number
of bits consecutively set to 1.

I recently gave an overview of how the UTF-8 format in this thread:
http://groups.google.ca/group/comp.lang.java.programmer/msg/24e73683b6f33a81?hl=en&

- Oliver
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,149
Members
46,695
Latest member
StanleyDri

Latest Threads

Top