Stream and Encoding Confusion

R

Rhino

A friend and I are having a friendly competition that is causing me some
conceptual confusion. I am hoping someone can help me clarify things a
little.

We are each writing programs to read an input file and count the number of
each distinct character in the in the file; he is writing his program in
Perl and I am writing mine in Java. The main output of the prgram will be a
simple list that says we the program found so many of each character; we
want to report the letters of the alphabet as well as accented letters,
punctuation, and whitespace characters, including carriage returns and
linefeeds. We have two input files at the moment, a text file and an MP3
file. There is no money or serious rivalry invoved; we are simply curious
about how each will look if properly written. We also wonder how the
performance will compare, although that is quite unimportant to both of us.

I have a couple of areas of confusion:
a. character streams vs. byte streams
b. the issue of encoding.

Since I'd like to be able to read any type of file in any language,
including text files, MP3s, and many others, should I always be treating the
input file as a character stream or do I need to somehow detect which ones
are best read as character streams and which are best read as byte streams?
If I need to treat the two types differently, how do I detect which type the
input file is? I would rather not rely on the user knowing whether a file
that he wants to give the program is best suited to being treated as a
character stream or a byte stream. I've read the conceptual information
about this in the Java Tutorial and find that it really doesn't address this
issue clearly.

I'm also somewhat concerned about encoding. I honestly don't understand
exactly how encoding works and apologize if this is a dumb question but this
seemed like a good place to get someone to point me to a proper discussion
of this issue. Do I need to know how a file is encoded before I open it and
decide which kind of stream it is? Or is there some way to determine what
encoding the file is using by simply examing the file? Again, I want to be
able to read a file and count the characters without the provider of the
file having to tell me what encoding it uses since the provider, quite
likely, wouldn't know.
 
M

Matt Humphrey

Rhino said:
A friend and I are having a friendly competition that is causing me some
conceptual confusion. I am hoping someone can help me clarify things a
little.

We are each writing programs to read an input file and count the number of
each distinct character in the in the file; he is writing his program in
Perl and I am writing mine in Java. The main output of the prgram will be
a simple list that says we the program found so many of each character; we
want to report the letters of the alphabet as well as accented letters,
punctuation, and whitespace characters, including carriage returns and
linefeeds. We have two input files at the moment, a text file and an MP3
file. There is no money or serious rivalry invoved; we are simply curious
about how each will look if properly written. We also wonder how the
performance will compare, although that is quite unimportant to both of
us.

I have a couple of areas of confusion:
a. character streams vs. byte streams
b. the issue of encoding.

Since I'd like to be able to read any type of file in any language,
including text files, MP3s, and many others, should I always be treating
the input file as a character stream or do I need to somehow detect which
ones are best read as character streams and which are best read as byte
streams? If I need to treat the two types differently, how do I detect
which type the input file is? I would rather not rely on the user knowing
whether a file that he wants to give the program is best suited to being
treated as a character stream or a byte stream. I've read the conceptual
information about this in the Java Tutorial and find that it really
doesn't address this issue clearly.

I'm also somewhat concerned about encoding. I honestly don't understand
exactly how encoding works and apologize if this is a dumb question but
this seemed like a good place to get someone to point me to a proper
discussion of this issue. Do I need to know how a file is encoded before I
open it and decide which kind of stream it is? Or is there some way to
determine what encoding the file is using by simply examing the file?
Again, I want to be able to read a file and count the characters without
the provider of the file having to tell me what encoding it uses since the
provider, quite likely, wouldn't know.

This issue was addressed not long here:

http://groups.google.com/group/comp...java.programmer&rnum=1&hl=en#08095f861a95f75a

You can find more about character encoding here

http://mindprod.com/jgloss/encoding.html

In summary, there is no way to perfectly distinguish between character and
non-character data and you must be able to distinguish them in order to use
the right kind of stream. All data is binary data. By convention (common
agreement) some binary patterns are used to represent text, characters
(including digits), numbers of all types (not digits), application data
structures, etc. In particular, the conventions for characters are given
names that identify the encoding--the mapping of byte values or code point
numbers to specific logical characters. You can always read binary data,
but you must know the encoding in order to make any sense out of it.

What makes this problem troublesome is that the identity of the encoding is
not in the data itself. Well, it is for some (e.g. XML has encoding
attribute and some kinds of application data files like GIF start with a
specific 4-byte signature), but because it's not there for all of them there
is no reliable way to distinguish whether you have (for example), a MP3 file
or some other unknown type of data file. Often you're better off just
checking the file extension, although that won't tell you the text encoding.

But with some smart decisions, you can often guess reasonably (if
imperfectly) at the format. Check out the links above to see how this might
work.

The bottom line is that you have to know the encoding in order to read the
file. To simplify your problem, you could, for example, limit yourselves to
one of the standard encodings (UTF-8, UTF-16) and work just with text. Or
designate one or two kinds of specific file types, e.g. MP3. No one has a
general-purpose interpreter that will give the correct answer to this
question for every file everywhere. And if they did I can make it give the
wrong answer by cooking up a data file for any new format that happens to
correspond to any existing format.

Cheers,
Matt Humphrey (e-mail address removed) http://www.iviz.com/
 
C

Chris Smith

Rhino said:
A friend and I are having a friendly competition that is causing me some
conceptual confusion. I am hoping someone can help me clarify things a
little.

We are each writing programs to read an input file and count the number of
each distinct character in the in the file; he is writing his program in
Perl and I am writing mine in Java.

So clearly, the first thing you need to do is discover which character
encoding your friend and you are agreeing to read. Some probable
answers include:

1. ASCII, which is "US-ASCII" as a Java encoding string. However, this
encoding has no accented characters.

2. ISO 8859-1, which amounts to just the lower 256 characters from the
Unicode character set, adopted as a trivial single-byte encoding. Easy
enough.

3. The platform default, which is what you get if you create an
InputStreamReader without an encoding parameter. This is a bad idea for
files, so one would hope that it's not the choice. But hey, the world
is an imperfect place.
We have two input files at the moment, a text file and an MP3
file.

Note that it is inherently nonsensical to talk about counting the number
of characters, letters, accented letters, etc in an MP3 file. There is
simply no meaning to those words; sorta like me telling you that I'm
going to count the number of rocks that exist in the conceptual idea of
peace. The MP3 file contains only bytes, and they cannot be validly
interpreted as characters. The fact that this is part of your friendly
competition does not bode well.

You need to find out what is meant by this. Perhaps what is meant is
"if I were to pretend that the MP3 file were text, and open it using
<insert some editor>, how many accented characters would I see?" In
that case, you'd need to experiment with that editor and see what
encoding it assumes when opening a text file. If your encoding does not
cover all possible byte sequences (for example, UTF-8 will fail to
decode certain combinations of trailing bytes at the end of a string),
you should find out what the application is supposed to do.

I can almost guarantee, of course, that you'll discover that your friend
hasn't thought about any of these things. Nevertheless, you need to
know them in order to write the desired code.
Since I'd like to be able to read any type of file in any language,
including text files, MP3s, and many others, should I always be treating the
input file as a character stream or do I need to somehow detect which ones
are best read as character streams and which are best read as byte streams?

Matt posted a link to a very simple piece of code for estimating a rough
score that correlates with the likelihood that a file is encoded in
ASCII (or, really, any ASCII superset). There's no real correct answer
here, though.
Do I need to know how a file is encoded before I open it and
decide which kind of stream it is?

Yes. Again, you could scan through the file and try to estimate, but
you'd need some complex statistical methods and results from simple
linguistics of various human languages, for example, to tell the
difference between the various ISO 8859 encodings. It won't be easy.
You are supposed to know in advance.
 
C

Chris Smith

Chris Smith said:
So clearly, the first thing you need to do is discover which character
encoding your friend and you are agreeing to read.

It's worth noting, in case it wasn't clear, that this is not an issue
with the Java programming language or API. It is the requirements, not
the implementation, that are unclear. If some other API fails to make
it clear that this decision must be made, then the fault lies in that
other API.
 
C

Chris Uppal

Rhino said:
We are each writing programs to read an input file and count the number of
each distinct character in the in the file; he is writing his program in
Perl and I am writing mine in Java. The main output of the prgram will be
a simple list that says we the program found so many of each character; we
want to report the letters of the alphabet as well as accented letters,
punctuation, and whitespace characters, including carriage returns and
linefeeds. We have two input files at the moment, a text file and an MP3
file. There is no money or serious rivalry invoved; we are simply curious
about how each will look if properly written.

Expanding a bit on the earlier replies...

I suggest you change the challenge a little. As it stands -- and as Chris
Smith has explained -- it simply isn't a coherent task. As such it can't have
a "properly written" solution in Java; the nearest you can get is a badly
designed program which doesn't do what it might look (to the naive) as if it's
doing.

Exactly the same problem applies to the Perl program. I don't know enough
about modern Perl to know whether it is even /possible/ to solve it correctly
in Perl. I'm pretty sure it was impossible when I last looked at Perl (and
shuddered and looked away again quick), but Perl has changed a lot since then.

As I said, I suggest you change the terms of the challenge. Maybe the
following (which /is/ well-posed) would suit you and your friend (or rival ;-).

0) Given a file, produce a list of how often each byte value occurs in it.
I.e. interpret it as binary. (You may have agree in advance on whether you
treat bytes as signed or unsigned).

1) Given a file, /and/ an assumption about its character encoding, produce a
simple list [..etc...] Presumably the encoding would be specified on the
command-line along with the file name. I suggest that you make UTF-16 the
default (which may help you to keep the difference between the binary bytes in
the file, and their interpretation as characters, clear in your mind).

2) (For extra credit ;-) Given a file, attempt to guess what encoding it is
in, using whatever heuristics come to mind.

-- chris
 
R

Rhino

Rhino said:
A friend and I are having a friendly competition that is causing me some
conceptual confusion. I am hoping someone can help me clarify things a
little.
[snip]

Thank you all for your very valuable and helpful replies to my question. I
concur completely with Chris Smith that this is entirely a problem of
defining the requirements of the program and does not demonstrate any
inadequacy with the Java language or the API.

My friend and I will discuss this and figure out how to make the problem
solvable. This really is a friendly challenge and neither of us is looking
to make this into a major consumer of our time. I expect that we will either
restrict the files to particular formats or insist that the type of file be
supplied as an input parameter.
 
M

Matt Humphrey

Chris Smith said:
Note that it is inherently nonsensical to talk about counting the number
of characters, letters, accented letters, etc in an MP3 file. There is
simply no meaning to those words; sorta like me telling you that I'm
going to count the number of rocks that exist in the conceptual idea of
peace. The MP3 file contains only bytes, and they cannot be validly
interpreted as characters. The fact that this is part of your friendly
competition does not bode well.

I certainly agree that it's nonsensical to interpret an arbitrary file (of
unknown type) as characters. Do you mean to say that MP3 contains no
character data at all? I would expect the title and artist strings to be in
there somewhere. Plenty of specific data structures (e.g. Excel, Word,
GIFs) have parts that can be legitimately decoded as character data, Of
course, to count those characters you would have to know in advance the file
structure, where to find the strings and what their encoding is, which is
where this whole thing started.

Cheers,
Matt Humphrey (e-mail address removed) http://www.iviz.com/
 
C

Chris Smith

Matt Humphrey said:
I certainly agree that it's nonsensical to interpret an arbitrary file (of
unknown type) as characters. Do you mean to say that MP3 contains no
character data at all? I would expect the title and artist strings to be in
there somewhere. Plenty of specific data structures (e.g. Excel, Word,
GIFs) have parts that can be legitimately decoded as character data, Of
course, to count those characters you would have to know in advance the file
structure, where to find the strings and what their encoding is, which is
where this whole thing started.

Oddly enough, I stayed up most of the night thinking about that
paragraph that I wrote! :) I no longer agree with it.

First and foremost, the meaning was intended to be more like: it is
inherently non-sensical to interpret an MP3 file as if the file itself
were a stream of characters. There is potentially some character data
in an MP3, though I don't personally know whether an MP3 includes the
title or author or not. That character data may well be compressed, of
course, so you are perhaps unlikely to get to it merely by reading the
file as if it were ASCII or something like that.

Going further, though, it's not necessarily *inherently* non-sensical to
do so. It's merely that the necessary character encodings to do so --
in a way that would be sensical -- would not reside within the standard
set recognized or implemented by the Java programming language. Reading
an MP3 file in ISO 8859-1 or the like is non-sensical, but only because
ANY encoding exhibits sense only when the software that created the file
was aware of the same encoding. I would be shocked to find that there's
any reasonable piece of knowledge that can be gained from knowing how
many characters belong to the set of accented letters in the ISO 8859-1
interpretation of an MP3 file. However, it is of course possible to
construct a meaningful textual representation of the data contained
within an MP3 file, and for at least certain straight-forward ways of
doing so (in fact, I tentatively believe until someone provides a good
reason to the contrary, for all ways of doing so), the result may be
reasonably described as a complex kind of character encoding.

In other words, I fear that I overstated the uniqueness of textual data.
In fact, there's nothing special about text at all; it's just yet
another in the infinite list of semantic interpretations that can be
assigned to any binary file. It just happens to be common enough that
Java provides implementations for some types of character encodings.

So, to be entirely clear, there is NOTHING special about text. However,
you probably don't want to read an MP3 file as if it were text; and if
you do, the Java standard API doesn't provide the tools to do so in
particularly useful ways.
 
O

Oliver Wong

Matt Humphrey said:
Chris Smith said:
Rhino said:
We are each writing programs to read an input file and count the number
of
each distinct character in the in the file [...]
We have two input files at the moment, a text file and an MP3
file.

Note that it is inherently nonsensical to talk about counting the number
of characters, letters, accented letters, etc in an MP3 file. There is
simply no meaning to those words; sorta like me telling you that I'm
going to count the number of rocks that exist in the conceptual idea of
peace. The MP3 file contains only bytes, and they cannot be validly
interpreted as characters. The fact that this is part of your friendly
competition does not bode well.

I certainly agree that it's nonsensical to interpret an arbitrary file (of
unknown type) as characters. Do you mean to say that MP3 contains no
character data at all? I would expect the title and artist strings to be
in there somewhere. Plenty of specific data structures (e.g. Excel, Word,
GIFs) have parts that can be legitimately decoded as character data, Of
course, to count those characters you would have to know in advance the
file structure, where to find the strings and what their encoding is,
which is where this whole thing started.

My first interpretation of the requirements was that the entire mp3 file
be read in as a stream of bytes, and then someone decoded into a sequence of
characters. It's certainly "do-able", but it's also nonsensical, as Chris
pointed out.

Of course, your interpretation, Matt, of decoding the ID3 data is
feasible as well. Rhino might want to get some clarification (or clarify for
us, if he already knows) on this point.

- Oliver
 
D

Dale King

Chris said:
Exactly the same problem applies to the Perl program. I don't know enough
about modern Perl to know whether it is even /possible/ to solve it correctly
in Perl. I'm pretty sure it was impossible when I last looked at Perl (and
shuddered and looked away again quick), but Perl has changed a lot since then.

I don't know for certainty with Perl, but when I was doing a Linux
install this week I noticed that one of the packages installed was
perl-unicode which reminded me of this thread. So it would seem that
Perl has some form of support for Unicode.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top