Reading a unicode (UTF-16) file

Steve · Sep 29, 2005

Hi,

I have a text file with some french lines saved in the UTF-16 format.
Most of the words can be represented easily with a byte but some special
accept alphabets are 16 bits. I'm trying to read this file in from a
MIDLET and so the only way I can do this is:

InputStream is = this.getClass().getResourceAsStream("file.txt");
byte[] data = new byte[50000];
is.read(data);

I am trying to draw each line on the canvas using the drawString
Graphics method. However, I am having difficulty finding 16bit letters
from the text file and displaying them correctly. If I convert each byte
to a char, most letters pass through except the 'special' ones, for
which I think I need to take the next byte in the stream and join it
together with the previous one. Even then, the problem is that Unicode
characters need to be represented like the following: '\u0045'. Now
since I am storing each character of a single line in a char array, I
can't seem to "join" two bytes and then add '\u' in front - it fails and
complains that this isn't a character but a string. I hope I was able to
express my problem clearly. This is what I'm doing:

char[] line = new line[150];
for (int i=0;i<150;i++) {
line = (char)data[lineOffset+i];
}

If I print line with Graphics.drawChars, it works but gives me weird
characters in place of actual single unicode characters - for a single
unicode character it gives me two characters.

In short, how do I tell that I'm about to stumble on a unicode character
by looking at the byte being returned and how can I 'join' two bytes to
represent a single character. Any help would be most appreicated.

Thanks,
Steve

Roedy Green · Sep 29, 2005

I have a text file with some french lines saved in the UTF-16 format.
Most of the words can be represented easily with a byte but some special
accept alphabets are 16 bits. I'm trying to read this file in from a
MIDLET and so the only way I can do this is:

That sounds much more like UTF-8 than UTF-16.

I suggest reading http://mindprod.com/jgloss/utf.html
http://mindprod.com/jgloss/encoding.html
and
http://mindprod.com/applets/fileio.html on how to read encoded data.

If you have no way to read UTF-8 in a midlet, then you could roll your
own. I show you the encoding code. Follow the links and you should
find an algorithm description of the decoding.

I find that strange. If midlets could read ANYTHING it would expect it
would be UTF-8.

Roedy Green · Sep 29, 2005

If I print line with Graphics.drawChars, it works but gives me weird
characters in place of actual single unicode characters - for a single
unicode character it gives me two characters.

Separate your drawing problem from your reading the encoded file
problem. Dump out the 16-bit decoded chars in hex to make sure they
are plausible. You also need a font that will support those
characters. See http://mindprod.com/applets/fontshowerawt.html
for help finding a good font.

Oliver Wong · Sep 29, 2005

Steve said:
Hi,

I have a text file with some french lines saved in the UTF-16 format. Most
of the words can be represented easily with a byte but some special accept
alphabets are 16 bits. I'm trying to read this file in from a MIDLET and
so the only way I can do this is:

InputStream is = this.getClass().getResourceAsStream("file.txt");
byte[] data = new byte[50000];
is.read(data);

I am trying to draw each line on the canvas using the drawString Graphics
method. However, I am having difficulty finding 16bit letters from the
text file and displaying them correctly. If I convert each byte to a char,
most letters pass through except the 'special' ones, for which I think I
need to take the next byte in the stream and join it together with the
previous one. Even then, the problem is that Unicode characters need to be
represented like the following: '\u0045'. Now since I am storing each
character of a single line in a char array, I can't seem to "join" two
bytes and then add '\u' in front - it fails and complains that this isn't
a character but a string. I hope I was able to express my problem clearly.
This is what I'm doing:

char[] line = new line[150];
for (int i=0;i<150;i++) {
line = (char)data[lineOffset+i];
}

If I print line with Graphics.drawChars, it works but gives me weird
characters in place of actual single unicode characters - for a single
unicode character it gives me two characters.

In short, how do I tell that I'm about to stumble on a unicode character
by looking at the byte being returned and how can I 'join' two bytes to
represent a single character. Any help would be most appreicated.

In UTF-16, all characters are encoded in 16 bits or longer. If most of
the latin characters are encoded as 8 bit, with a few accented characters
encoded in 16 bits, then you are probably working with UTF-8.

I'm not familiar with the Java2ME API, but wouldn't there a method to
automatically encode and decode for you?

As for knowing when you need to join 2 (or more) bytes together in
UTF-8, it's relatively easy: If the first bit is 1, then you need to join 2
bytes together. The number of bytes you need to join together is the number
of bits consecutively set to 1.

I recently gave an overview of how the UTF-8 format in this thread:
http://groups.google.ca/group/comp.lang.java.programmer/msg/24e73683b6f33a81?hl=en&

- Oliver

Unicode >16 Bit JTextPane etc..	16	Jun 8, 2013
Unicode (UTF-8) in C	13	Mar 16, 2014
Reading/writing a dictionary to file problem :(	1	Mar 31, 2020
String default encoding: UTF-16 or Platform's default charset?	14	Dec 10, 2010
Read utf-8 file return utf-16 coding hex string ?	18	Jan 29, 2010
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
Unicode	2	Mar 15, 2013
Thinking Unicode	0	Aug 8, 2013

Reading a unicode (UTF-16) file

Steve

Roedy Green

Roedy Green

Oliver Wong

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads