How to parse and manipulate a binary stream

T

topcat.nyc

Apologies in advance if my question is silly or trivial. I'm trying to
write a servlet that reads data from another source in byte[] form and,
having parsed this data stream and made a couple of modifications,
sends the modified data to an appropriate application whereby it can be
rendered in Excel or PDF format.

What I need to find out is how I can parse the incoming data, detect
certain string patterns in the data, and manipulate that information to
generate a new data stream.

TIA,
tc
 
M

Mark Jeffcoat

Apologies in advance if my question is silly or trivial. I'm trying to
write a servlet that reads data from another source in byte[] form and,
having parsed this data stream and made a couple of modifications,
sends the modified data to an appropriate application whereby it can be
rendered in Excel or PDF format.

What I need to find out is how I can parse the incoming data, detect
certain string patterns in the data, and manipulate that information to
generate a new data stream.


Not a trivial question, exactly, but a bit on the
vague side. It's difficult for me to tell what you're
having difficultly with.


I'm going to assume that you're stuck getting started.

First, you need to read binary data from a source. That's
exactly the job of an InputStream. It has a read() method
that lets to read directly into a byte array.

To parse that a portion of a byte[] as a String, you can
just use the constructors in the String class.

To write the output, you need an OutputStream. You may want
to subclass it, and write a write() that asks for its next
byte of output from the object responsible for doing the
search-and-replace manipulation.
 
T

topcat.nyc

Mark said:
Not a trivial question, exactly, but a bit on the
vague side. It's difficult for me to tell what you're
having difficultly with.

Sorry, my question *was* very vaguely worded.
I'm going to assume that you're stuck getting started.

First, you need to read binary data from a source. That's
exactly the job of an InputStream. It has a read() method
that lets to read directly into a byte array.

To parse that a portion of a byte[] as a String, you can
just use the constructors in the String class.

To write the output, you need an OutputStream. You may want
to subclass it, and write a write() that asks for its next
byte of output from the object responsible for doing the
search-and-replace manipulation.

My problem is that the conversion of the input stream into
character/String data doesn't give me anything meaningful - not enough
to parse and manipulate, at any rate. I suppose what I'm wondering is
whether there's any reference material that describes how an encoded
input stream of data (be it for Excel or PDF) can be "translated" into
a String representation in order to do basic String manipulations, and
then re-encoded and passed on to the next application.
 
M

Matt Humphrey

Mark Jeffcoat wrote:
To parse that a portion of a byte[] as a String, you can
just use the constructors in the String class.

To write the output, you need an OutputStream. You may want
to subclass it, and write a write() that asks for its next
byte of output from the object responsible for doing the
search-and-replace manipulation.

My problem is that the conversion of the input stream into
character/String data doesn't give me anything meaningful - not enough
to parse and manipulate, at any rate. I suppose what I'm wondering is
whether there's any reference material that describes how an encoded
input stream of data (be it for Excel or PDF) can be "translated" into
a String representation in order to do basic String manipulations, and
then re-encoded and passed on to the next application.

Source data like Excel and GIFs don't have any natural string equivalent and
cannot be "parsed" in the sense of parsing strings. PDF is largely text but
may have some segments in binary--I don't know offhand how the binary parts
work. To "parse" true binary files you have to know the file structure.
You can go to http://www.wotsit.org/ to get information on file format.

Matt Humphrey (e-mail address removed) http://www.iviz.com/
 
M

Mark Jeffcoat

My problem is that the conversion of the input stream into
character/String data doesn't give me anything meaningful - not enough
to parse and manipulate, at any rate. I suppose what I'm wondering is
whether there's any reference material that describes how an encoded
input stream of data (be it for Excel or PDF) can be "translated" into
a String representation in order to do basic String manipulations, and
then re-encoded and passed on to the next application.


Yeah, okay. I gave you a strategy that will work if you've
got some Strings in an encoding you already understand surrounded
by other miscellaneous bytes that you can ignore; if that's
not the case (which it surely can be, if the binary format
is trying to be clever with how it stores text), you have
a harder problem.

The first thing I'd do is run the Unix program "strings"
(which you can surely find for Windows, if you have to) on
some of the files you're interested in, and see if you're
in the happy case. (It's sounds like you've already done
something like that, but a quick second opinion won't hurt.)

If not, you'll have to handle each format you want to
parse separately. I really like the POI library for handling
Excel documents in Java.

http://jakarta.apache.org/poi/


There is surely something similar for PDF, but I've
never had the need of it; your Google will be as
good as mine.
 
T

topcat.nyc

Mark said:
Yeah, okay. I gave you a strategy that will work if you've
got some Strings in an encoding you already understand surrounded
by other miscellaneous bytes that you can ignore; if that's
not the case (which it surely can be, if the binary format
is trying to be clever with how it stores text), you have
a harder problem.

I figured out what the problem with the PDF data was. The binary stream
that I read in gives me PDF data in compressed form, which I discovered
after running a few tests on it. I downloaded a free tool, pdftk, to
help me uncompress the source data stream, perform my text
manipulations, and then recompress the modified data before passing
them on.

Thanks for your help, guys! I really appreciate it.

- tc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,001
Messages
2,570,254
Members
46,849
Latest member
Fira

Latest Threads

Top