R
Rhino
A friend and I are having a friendly competition that is causing me some
conceptual confusion. I am hoping someone can help me clarify things a
little.
We are each writing programs to read an input file and count the number of
each distinct character in the in the file; he is writing his program in
Perl and I am writing mine in Java. The main output of the prgram will be a
simple list that says we the program found so many of each character; we
want to report the letters of the alphabet as well as accented letters,
punctuation, and whitespace characters, including carriage returns and
linefeeds. We have two input files at the moment, a text file and an MP3
file. There is no money or serious rivalry invoved; we are simply curious
about how each will look if properly written. We also wonder how the
performance will compare, although that is quite unimportant to both of us.
I have a couple of areas of confusion:
a. character streams vs. byte streams
b. the issue of encoding.
Since I'd like to be able to read any type of file in any language,
including text files, MP3s, and many others, should I always be treating the
input file as a character stream or do I need to somehow detect which ones
are best read as character streams and which are best read as byte streams?
If I need to treat the two types differently, how do I detect which type the
input file is? I would rather not rely on the user knowing whether a file
that he wants to give the program is best suited to being treated as a
character stream or a byte stream. I've read the conceptual information
about this in the Java Tutorial and find that it really doesn't address this
issue clearly.
I'm also somewhat concerned about encoding. I honestly don't understand
exactly how encoding works and apologize if this is a dumb question but this
seemed like a good place to get someone to point me to a proper discussion
of this issue. Do I need to know how a file is encoded before I open it and
decide which kind of stream it is? Or is there some way to determine what
encoding the file is using by simply examing the file? Again, I want to be
able to read a file and count the characters without the provider of the
file having to tell me what encoding it uses since the provider, quite
likely, wouldn't know.
conceptual confusion. I am hoping someone can help me clarify things a
little.
We are each writing programs to read an input file and count the number of
each distinct character in the in the file; he is writing his program in
Perl and I am writing mine in Java. The main output of the prgram will be a
simple list that says we the program found so many of each character; we
want to report the letters of the alphabet as well as accented letters,
punctuation, and whitespace characters, including carriage returns and
linefeeds. We have two input files at the moment, a text file and an MP3
file. There is no money or serious rivalry invoved; we are simply curious
about how each will look if properly written. We also wonder how the
performance will compare, although that is quite unimportant to both of us.
I have a couple of areas of confusion:
a. character streams vs. byte streams
b. the issue of encoding.
Since I'd like to be able to read any type of file in any language,
including text files, MP3s, and many others, should I always be treating the
input file as a character stream or do I need to somehow detect which ones
are best read as character streams and which are best read as byte streams?
If I need to treat the two types differently, how do I detect which type the
input file is? I would rather not rely on the user knowing whether a file
that he wants to give the program is best suited to being treated as a
character stream or a byte stream. I've read the conceptual information
about this in the Java Tutorial and find that it really doesn't address this
issue clearly.
I'm also somewhat concerned about encoding. I honestly don't understand
exactly how encoding works and apologize if this is a dumb question but this
seemed like a good place to get someone to point me to a proper discussion
of this issue. Do I need to know how a file is encoded before I open it and
decide which kind of stream it is? Or is there some way to determine what
encoding the file is using by simply examing the file? Again, I want to be
able to read a file and count the characters without the provider of the
file having to tell me what encoding it uses since the provider, quite
likely, wouldn't know.