bytes, chars, and strings, oh my!

  • Thread starter David N. Welton
  • Start date
D

David N. Welton

Hello,

I am writing a classloader for a project of mine, where the load portion
of the code takes a string as input. Since defineClass takes a byte[]
as an argument, my first instinct was to use String.getBytes, but then I
realized that that *encodes* the bytes into some sort of format (such as
UTF-8) possibly mangling them, whereas it's getChars that just gives you
back what the string contains with no fuss. Encoding discussions still
make my head spin just a bit, but what ended up working for me was this:

String sdata = argv[2].toString();
int len = sdata.length();
char[] chars = new char[len];
byte[] bytes = new byte[len];
sdata.getChars(0, len, chars, 0);
for (int i = 0; i < chars.length; i++) {
bytes = (byte)chars;
}

and then shipping those bytes off to defineClass, which works just fine.
I got the chars off the disk in the first place, byte by byte, so going
back from char to byte ought to be ok...right? Is there a cleaner way
of doing this, though? Data is read into the String like so:

fis = new BufferedInputStream(new FileInputStream(realfn));
int total;
int ch;
for (total = 0; (ch = fis.read()) != -1; total ++) {
data.append((char)ch);
}

Thankyou for your time,
--
David N. Welton
- http://www.dedasys.com/davidw/

Linux, Open Source Consulting
- http://www.dedasys.com/
 
T

Thomas Fritsch

David N. Welton said:
I am writing a classloader for a project of mine, where the load portion
of the code takes a string as input. Since defineClass takes a byte[]
as an argument, my first instinct was to use String.getBytes, but then I
realized that that *encodes* the bytes into some sort of format (such as
UTF-8) possibly mangling them, whereas it's getChars that just gives you
back what the string contains with no fuss. Encoding discussions still
make my head spin just a bit, but what ended up working for me was this: [...]

fis = new BufferedInputStream(new FileInputStream(realfn));
int total;
int ch;
for (total = 0; (ch = fis.read()) != -1; total ++) {
data.append((char)ch);
}
As you already felt, String and char[] are not meant for handling byte[]
data. Actually it is rather inefficient. And, as you already noticed, you
have carefully to avoid any byte-char encoding, so that your data won't get
corrupted while processing.
The InputStream class (and hence any subclass of it) has read methods not
only for reading a single byte, but also for reading larger chunks of bytes
efficiently. Especially useful for your task is method
public int read(byte bytes[], int offset, int length)
There is no need to mess around with char, String, StringBuffer...

My favorite pattern for reading an entire file into a byte[] is like this:
InputStream stream = new BufferedInputStream(new
FileInputStream(fileName));
byte bytes[] = new byte[fileSize];
for (int totalLength = 0; totalLength < fileSize; /**/) {
int n = stream.read(bytes, totalLength, fileSize - totalLength);
if (n == -1) // EOF?
throw new EOFException("oops, file is shorter than expected");
totalLength += n;
}
stream.close();
// now bytes[] is ready to be fed into the ClassLoader by
defineClass(bytes, 0, fileSize)
 
D

David N. Welton

Thomas said:
David N. Welton said:
I am writing a classloader for a project of mine, where the load portion
of the code takes a string as input. Since defineClass takes a byte[]
as an argument, my first instinct was to use String.getBytes, but then I
realized that that *encodes* the bytes into some sort of format (such as
UTF-8) possibly mangling them, whereas it's getChars that just gives you
back what the string contains with no fuss. Encoding discussions still
make my head spin just a bit, but what ended up working for me was this:
[...]

fis = new BufferedInputStream(new FileInputStream(realfn));
int total;
int ch;
for (total = 0; (ch = fis.read()) != -1; total ++) {
data.append((char)ch);
}

As you already felt, String and char[] are not meant for handling byte[]
data. Actually it is rather inefficient. And, as you already noticed, you
have carefully to avoid any byte-char encoding, so that your data won't get
corrupted while processing.

Well, let me explain my goals a bit better, perhaps my mind is a bit
clearer at this time of day.

I'm working on a scripting language for Java, called Hecl
(www.hecl.org), and this problem came up while writing a class loader
for it (I suspect I'll have more questions about that:), because
obviously I needed to read data that's binary clean.

However, I would like, if possible, to have one method for "read a whole
file", so that normal scripts can use that to slurp up the text and use
it as a String. I don't mind doing a bit of extra processing for the
class loader, as it's not the common case, but I need to have a firm
grasp of what works and what doesn't.
The InputStream class (and hence any subclass of it) has read methods not
only for reading a single byte, but also for reading larger chunks of bytes
efficiently. Especially useful for your task is method
public int read(byte bytes[], int offset, int length)
There is no need to mess around with char, String, StringBuffer...

My favorite pattern for reading an entire file into a byte[] is like this:
InputStream stream = new BufferedInputStream(new
FileInputStream(fileName));
byte bytes[] = new byte[fileSize];
for (int totalLength = 0; totalLength < fileSize; /**/) {
int n = stream.read(bytes, totalLength, fileSize - totalLength);
if (n == -1) // EOF?
throw new EOFException("oops, file is shorter than expected");
totalLength += n;
}
stream.close();
// now bytes[] is ready to be fed into the ClassLoader by
defineClass(bytes, 0, fileSize)

Ok - then I could also use this to transform the bytes into a String by
then doing new String(bytes, "some encoding, possibly the system one")
for regular text files, right? I had hoped to have the extra processing
be for the classloader, which is the rarer case, but your code above is
indeed nice and efficient, and would eliminate the manual copying of
chars to bytes (is there a way around that one? it's ugly!) I was doing.

Thankyou,
--
David N. Welton
- http://www.dedasys.com/davidw/

Linux, Open Source Consulting
- http://www.dedasys.com/
 
R

Roedy Green

I got the chars off the disk in the first place, byte by byte, so going
back from char to byte ought to be ok...right?

There is the name of the class which is a String. There are the bytes
forming the JVM byte codes. They have nothing to do with char. Any
use of char for that will just screw things up.

To understand encoding see http://mindprod.com/jgloss/encoding.html
I don't think encoding has anything to do with your problem though.

To read raw bytes see http://mindprod.com/applets/fileio.html
 
C

Chris Uppal

David said:
However, I would like, if possible, to have one method for "read a whole
file", so that normal scripts can use that to slurp up the text and use
it as a String.

I suspect that you are oversimplifying. In the world of Unicode, there is
simply no useful similarity between binary data and strings. If you try to
pretend that they are somehow "the same thing" (as was perfectly reasonable in
the old days of 7-bit ASCII, or even 8-bit near-ASCII -- if you didn't move
data between languages), then you will come an awful cropper.

The alternative would be to make /all/ of Hecl's "strings" be sequences of
8-bit bytes (and so represent them as byte[] arrays in your implementation).
But then Hecl wouldn't be able to inherit Java's Unicode handling[*], and you
would be pushing the whole problem of character-encoding/national language
support/Unicode handling off onto the users of Hecl.

-- chris

([*] which might not be such a bad thing, since Java's Unicode handling is
fundamentally fucked-up)
 
T

Thomas Fritsch

David said:
[...]
Ok - then I could also use this to transform the bytes into a String by
then doing new String(bytes, "some encoding, possibly the system one")
for regular text files, right?
That might or might not rise new problems, because the system default
encoding may vary from system to system.

I had somewhat similar conceptual problems, when I tried to interpret
PostScript files from Java. (PostScript is a language that doesn't
distinguish between byte and char, because it was invented back in the
1980s era).
My solution there was to choose the "ISO-8859-1" (aka ISO-Latin-1)
encoding. "ISO-8859-1" is essential a no-encoding. Its byte->char
conversion is simply adding a zero high-byte. Its char->byte conversion
is dropping the zero high-byte, and treating all chars beyond '\u00FF'
as being illegal (i.e. converting to byte 63, which is '?').
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top