JNI byte array to string

S

static

Hi,

I use JNI to call a C function and get a record converted from MARC-8
to UNICODE and return the data in a byteArray. It works fine and the
byteArray is correctly populated. I can write it out to a file and
verified the data.

The problem is converting the byte array to a string. If I do

String n = new String(test);

Then about 6 of the characters get replaced with question marks. Is
there a way to retain all of the data from a byte array and convert it
to a String?

I also tried

String n = new String(test,"UTF-8");

and that didn't work. A few characters got replaced with question
marks.

Any ideas will be greatly appreciated.

Ashley
 
A

ak

I use JNI to call a C function and get a record converted from MARC-8
to UNICODE and return the data in a byteArray. It works fine and the
byteArray is correctly populated. I can write it out to a file and
verified the data.

The problem is converting the byte array to a string. If I do

String n = new String(test);

Then about 6 of the characters get replaced with question marks. Is
there a way to retain all of the data from a byte array and convert it
to a String?

I also tried

String n = new String(test,"UTF-8");

don't create String, but read it with DataInputStream#readUTF();
 
S

static

I tried the following and am still getting some characters in the byte
array changed to question marks.

try {
DataInputStream dis = new DataInputStream(new
ByteArrayInputStream(unicode_byte_array));
orig = dis.readLine();
}
catch (IOException e)
{
//System.out.println(e);
}

Since readLine is deprecated, is there another way to read the data
from the byte array and not change it. readString will change it and
since it is already in UTF-8, doing a readUTF8 corrupts the data by
translating something that is already in utf8.

Thanks in advance.

Ashley
 
R

Roedy Green

Since readLine is deprecated, is there another way to read the data
from the byte array and not change it. readString will change it and
since it is already in UTF-8, doing a readUTF8 corrupts the data by
translating something that is already in utf8.

There are many possible things you could be trying to do. You first
have to get clear on just what your data are.

1. 16-bit unicode
2. 8-bit chars in some encoding
3. binary data.
4. serialised objects.

Then you can ask the File I/O amanuensis to generate the necessary
code to read it.

See http://mindprod.com/fileio.html
 
R

Roedy Green

readUTF() doesn't create UTF, but read data wiich is in UTF format.

UTF is not just unicode-8. It is a special binary format with counted
strings. It is not designed to be human readable.
 
M

Michael Borgwardt

Roedy said:
UTF is not just unicode-8. It is a special binary format with counted
strings. It is not designed to be human readable.

There's no such thing as "unicode-8", and UTF-8 is exactly as "human readable" as
ASCII (to which it is downwards-compatible) or any other text encoding.
The readUTF() method simply expects a sequence of UTF-8 encoded characters
prepended by two bytes specifying the length of the sequence.
 
S

static

guys I tried the readUTF() but if I print out orig, the output doesn't
match the output from the unicode_byte_array. The whole string seems
like it shrunk the byte array down. I would like to print the String
and have the output match the byte array. I also tried writing the
data to a file and reading it with

InputStream ba = new FileInputStream("test");
DataInputStream dis = new DataInputStream(ba);
orig = dis.readUTF();

but when I print out orig, the output is different. I'll be glad to
mail you my data file which is about 1830 bytes for you to try.

Thanks so much for the input. Any other ideas?

Ashley
 
R

Roedy Green

There's no such thing as "unicode-8", and UTF-8 is exactly as "human readable" as
ASCII (to which it is downwards-compatible) or any other text encoding.
The readUTF() method simply expects a sequence of UTF-8 encoded characters
prepended by two bytes specifying the length of the sequence.

People try to use writeUTF to create human-readable files. They are
not because of the length fields.
 
A

ak

but when I print out orig, the output is different. I'll be glad to
mail you my data file which is about 1830 bytes for you to try.

post an attachment, and dont forget to post also original string.
 
M

Michael Borgwardt

static said:
guys I tried the readUTF() but if I print out orig, the output doesn't
match the output from the unicode_byte_array.

That's because your input doesn't contain the length fields that
readUTF() expects.

The method is not meant to be used for processing text files, rather for
processing text embedded in binary files.

Instead, use the Reader classes.
 
S

static

Here's what my byte array contains

01830cam a22003734a 45000010009000000050017000090080041000269060045000679250042001129550123001540100017002770200015002940350026003090400024003350420008003590430012003670500025003792450191004042460052005952600068006473000041007155040066007566500045008226510057008676500064009247000026009887000026010148800203010408800085012438800040013288800032013689230030014009520026014301211021220020226122429.00
0717s2000 is a b 001 0 heb 
a7bcbccorignewd2encipf20gn-rlinjack0 aacquireb1 shelf
copyxdefault policy amb12 to RCCD 07/17/00; desc ye91 09-21-00; to
ye19 09-21-00 (Heidi Lerner); ye19 to sl 01-19-01; ye04 to BCCD
02-08-01 a 00377460  a9654484749 a(CStRLIN)DCLH00-B1877
aDLC-RcDLC-RdDLC-R apcc aa-is---00aNX573.7.A1bA15
2000106880-01a1900-2000 :bmeʾah shenot tarbut : ha-yetsirah
ha-ʻIvrit be-Erets-Yiśraʾel = hundred years of Hebrew
culture in Eretz Israel /c[ʻorkhim], Orah Aḥimeʾir,
Ḥayim Beʾer.30aHundred years of Hebrew culture in Eretz
Israel 6880-02aTel Aviv :bʻAm ʻoved :bYediʻot
aḥaronot,cc2000. a548 p. :bill. (some col.) ;c31 cm.
aIncludes bibliographical references (p. 512-517) and indexes.
0aArts, Israeliy20th centuryvChronology. 0aIsraelxIntellectual
lifey20th centuryvChronology. 0aPopular
culturezIsraelxHistoryy20th centuryvChronology.1
6880-03aAhimeir, Ora.1 6880-04aBeʾer,
Haim.106245-01/raמאה שנות
תרבות
:bהיצירה
העברית
בארץ־ישראל
= hundred years of Hebrew culture in Eretz Israel
/c[עורכים],
אורה
אחימאיר,
חיים באר.
6260-02/raתל אביב
:bעם עובד
:bידיעות
אחרונות,cc2000.1
6700-03/raאחימאיר,
אורה.1 6700-04/raבאר,
חיים. d20000430n12287s93005373
a02/19/02 T;11/07/01 T

A few of the hebrew characters are getting replaced with question
marks when I do the readUTF. I hope some characters didn't get
translated by copying and pasting here. Thanks for the help.

Ashley
 
S

static

Here's the String after I tried to convert the byte array to a String.
It ends up loosing several characters.

01830cam a22003734a 45000010009000000050017000090080041000269060045000679250042001129550123001540100017002770200015002940350026003090400024003350420008003590430012003670500025003792450191004042460052005952600068006473000041007155040066007566500045008226510057008676500064009247000026009887000026010148800203010408800085012438800040013288800032013689230030014009520026014301211021220020226122429.00
0717s2000 is a b 001 0 heb 
a7bcbccorignewd2encipf20gn-rlinjack0 aacquireb1 shelf
copyxdefault policy amb12 to RCCD 07/17/00; desc ye91 09-21-00; to
ye19 09-21-00 (Heidi Lerner); ye19 to sl 01-19-01; ye04 to BCCD
02-08-01 a 00377460  a9654484749 a(CStRLIN)DCLH00-B1877
aDLC-RcDLC-RdDLC-R apcc aa-is---00aNX573.7.A1bA15
2000106880-01a1900-2000 :bmeʾah shenot tarbut : ha-yetsirah
ha-Ê»Ivrit be-Erets-YisÌ?raʾel = hundred years of Hebrew culture in
Eretz Israel /c[ʻorkhim], Orah Aḥimeʾir, Ḥayim
Beʾer.30aHundred years of Hebrew culture in Eretz Israel
6880-02aTel Aviv :bʻAm ʻoved :bYediʻot aḥaronot,cc2000.
a548 p. :bill. (some col.) ;c31 cm. aIncludes bibliographical
references (p. 512-517) and indexes. 0aArts, Israeliy20th
centuryvChronology. 0aIsraelxIntellectual lifey20th
centuryvChronology. 0aPopular culturezIsraelxHistoryy20th
centuryvChronology.1 6880-03aAhimeir, Ora.1 6880-04aBeʾer,
Haim.106245-01/raמ×?×" שנות תרבות :b×"יציר×"
×"עברית ב×?רץ־ישר×?ל = hundred years of Hebrew culture in
Eretz Israel /c[עורכי×?], ×?ור×" ×?חימ×?יר, ×—×™×™×?
ב×?ר. 6260-02/raתל ×?ביב :b×¢×? עוב×" :b×™×"יעות
×?חרונות,cc2000.1 6700-03/ra×?חימ×?יר, ×?ור×".1
6700-04/raב×?ר, ×—×™×™×?. d20000430n12287s93005373
a02/19/02 T;11/07/01 T
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,225
Members
46,815
Latest member
treekmostly22

Latest Threads

Top