Windows nuisance character in .txt files

D

Dave Stallard

I ran into a problem last week with some Java code I have that read in a
text file and parsed it up, and built some data structures. It had
worked fine for 2 years, running on Win2K and previous WinXP machines,
but on a new WinXP laptop failed (in front of customers, naturally),
claiming that a well-formed input text file was ill-formed. The text
file had been created with Notepad.

I looked at the file with Emacs and found three initial characters that
were outside the normal ASCII set. When I put some logging into the
file parsing code (which reads the file UTF-8), I found a single initial
character, Unicode point 65,279. (Evidently, the UTF-8 reader saw the
3 bytes as this one Unicode character.) I repeated this expt many times
with the same result. It is clear that Windows was inserting this
character, whatever it is, as a kind of header into every text file it
makes.

I had previously run into this problem with Notepad writing files out in
UTF-8, but never before when writing simple ASCII txt files. Has
anybody else seen this?

Dave
 
S

Steve Horsley

Dave said:
I ran into a problem last week with some Java code I have that read in a
text file and parsed it up, and built some data structures. It had
worked fine for 2 years, running on Win2K and previous WinXP machines,
but on a new WinXP laptop failed (in front of customers, naturally),
claiming that a well-formed input text file was ill-formed. The text
file had been created with Notepad.

I looked at the file with Emacs and found three initial characters that
were outside the normal ASCII set. When I put some logging into the
file parsing code (which reads the file UTF-8), I found a single initial
character, Unicode point 65,279. (Evidently, the UTF-8 reader saw the
3 bytes as this one Unicode character.) I repeated this expt many times
with the same result. It is clear that Windows was inserting this
character, whatever it is, as a kind of header into every text file it
makes.

I had previously run into this problem with Notepad writing files out in
UTF-8, but never before when writing simple ASCII txt files. Has
anybody else seen this?

Dave

The character is a Unicode Byte Order mark (BOM). It is used to indicate
the byte order (big or little end first) of a text file. U believe that
you should ignore this character in your application. I assumed a Reader
set for Unicode would discard it as non-text, but maybe it doesn't

Steve
 
D

Dave Stallard

Steve said:
The character is a Unicode Byte Order mark (BOM). It is used to indicate
the byte order (big or little end first) of a text file. U believe that
you should ignore this character in your application. I assumed a Reader
set for Unicode would discard it as non-text, but maybe it doesn't

Steve,

Ah! Thanks. I looked up BOM and it (U+FEFF) is indeed my nuisance
char. The odd thing is that I am reading the file with:

new BufferedReader(new InputStreamReader(new
FileInputStream(file),"UTF-8"))

and am still seeing the BOM as a character. Like you, I would have
thought that it would be ignored, but apparently it is not.

Dave
 
S

Steve Jasper

Yeah, this has happened to me quite a few times as well. I've gotten
to using the dos2unix utility in my ant build scripts to parse out
those extra characters before using the file.

Perhaps you could try the same? I found a windows Dos2Unix.exe file
that will accomplish this on windows for you as well.
 
S

Steve Horsley

Dave said:
Ah! Thanks. I looked up BOM and it (U+FEFF) is indeed my nuisance
char. The odd thing is that I am reading the file with:

new BufferedReader(new InputStreamReader(new
FileInputStream(file),"UTF-8"))

and am still seeing the BOM as a character. Like you, I would have
thought that it would be ignored, but apparently it is not.
That's a bummer. I use readLine() and startsWith(keyword) in several
places. That will break for line 1 of a config file. I need to
re-visit some code.

Steve
 
D

Dave Stallard

Steve said:
That's a bummer. I use readLine() and startsWith(keyword) in several
places. That will break for line 1 of a config file. I need to re-visit
some code.

Exactly what was happening to me. I'll be curious to see if you see the
problem also. Like I said, this was a newer XP Pro laptop - older ones
I've put this code on didn't have the problem.

Dave
 
R

Roedy Green

Unfortunately Sun gives you very little help in this regard. What
they should do is providea character encoding that looks for an
initial byte order mark and if present handles the rest of the
input according to the byte order mark. If no BOM is present it
would revert to the default for the given platform. Furthermore
this should be the default encoding paricularly for things like
javac.

It does seem strange that Java being so thoroughly unicode and so
thoroughly multiplatform that it still wants only native encoded 8-bit
source files that are not particularly portable.
 
R

Roedy Green

It does seem strange that Java being so thoroughly unicode and so
thoroughly multiplatform that it still wants only native encoded 8-bit
source files that are not particularly portable.

javac has a switch -encoding <encoding> Specify character
encoding used by source files

The problem is encodings are not self-identifying. JavaC needs extra
help to know what sort of encoding is being used, however a BOM in the
source code is a pretty good hint what sort of flavour of unicode is
being used.

see http://mindprod.com/jgloss/encoding.html
 
R

Roedy Green

The problem is encodings are not self-identifying.

At some point a file is going to have to come with a descriptor object
that tells you what format it is. For a simple text file it would
specify the encoding.

For xml it would tell you the DTD.

The descriptor would possibly contain a digital signature of the file,
copyright info etc. all in standard format so it can be computer
processed.

It would tell you what program created it/it belongs to, who its human
owner is.

It might tell you the URL where the master copy resides, and a
globally unique file number/name for use it bulk file distribution
from decentralised sources.

The idea would be the descriptor would contain enough information or
pointers to enough information that you could always view a file and
even usually edit it with generic tools.
 
S

Sudsy

Roedy Green wrote:
The idea would be the descriptor would contain enough information or
pointers to enough information that you could always view a file and
even usually edit it with generic tools.

Hmmm...almost sounds like a resource fork and a data fork.
Now where have I seen that before? MacOS?
 
D

Dale King

Hello, Steve Jasper!
You said:
Yeah, this has happened to me quite a few times as well. I've gotten
to using the dos2unix utility in my ant build scripts to parse out
those extra characters before using the file.

Perhaps you could try the same? I found a windows Dos2Unix.exe file
that will accomplish this on windows for you as well.

Dave Stallard <[email protected]> wrote in message

These are not nuisance characters. If the file is a text file in
some form of UTF encoding it should have a BOM. Programs should
recognize the BOM and handle it appropriately. Even lowly Windoze
Notepad is smart enough to do that.

Trying to find ways to strip them from the file is not the way to
do it.

Unfortunately Sun gives you very little help in this regard. What
they should do is providea character encoding that looks for an
initial byte order mark and if present handles the rest of the
input according to the byte order mark. If no BOM is present it
would revert to the default for the given platform. Furthermore
this should be the default encoding paricularly for things like
javac.

Perhaps I should just go ahead and create my own since Sun
refuses to do so and in 1.4 we can creat charsets.
 
D

Dale King

Hello, Roedy Green !

It baffles me that they can't get this one simple thing right.
javac has a switch -encoding <encoding> Specify character
encoding used by source files

Which is pretty much useless unless everyone of your source files
uses the same encoding. And that is only one tool. What about
other tools like javadoc or rmic, and other files like properties
files.
The problem is encodings are not self-identifying. JavaC needs extra
help to know what sort of encoding is being used, however a BOM in the
source code is a pretty good hint what sort of flavour of unicode is
being used.

That's the point. A file with a BOM is more than a hint, it is
self-identifying. It's not as though those bytes at the beginning
of a Java source file could have any other valid meaning to the
compiler.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top