Windows nuisance character in .txt files

Dave Stallard · May 10, 2004

I ran into a problem last week with some Java code I have that read in a
text file and parsed it up, and built some data structures. It had
worked fine for 2 years, running on Win2K and previous WinXP machines,
but on a new WinXP laptop failed (in front of customers, naturally),
claiming that a well-formed input text file was ill-formed. The text
file had been created with Notepad.

I looked at the file with Emacs and found three initial characters that
were outside the normal ASCII set. When I put some logging into the
file parsing code (which reads the file UTF-8), I found a single initial
character, Unicode point 65,279. (Evidently, the UTF-8 reader saw the
3 bytes as this one Unicode character.) I repeated this expt many times
with the same result. It is clear that Windows was inserting this
character, whatever it is, as a kind of header into every text file it
makes.

I had previously run into this problem with Notepad writing files out in
UTF-8, but never before when writing simple ASCII txt files. Has
anybody else seen this?

Dave

Steve Horsley · May 10, 2004

Dave said:
I ran into a problem last week with some Java code I have that read in a
text file and parsed it up, and built some data structures. It had
worked fine for 2 years, running on Win2K and previous WinXP machines,
but on a new WinXP laptop failed (in front of customers, naturally),
claiming that a well-formed input text file was ill-formed. The text
file had been created with Notepad.

I looked at the file with Emacs and found three initial characters that
were outside the normal ASCII set. When I put some logging into the
file parsing code (which reads the file UTF-8), I found a single initial
character, Unicode point 65,279. (Evidently, the UTF-8 reader saw the
3 bytes as this one Unicode character.) I repeated this expt many times
with the same result. It is clear that Windows was inserting this
character, whatever it is, as a kind of header into every text file it
makes.

I had previously run into this problem with Notepad writing files out in
UTF-8, but never before when writing simple ASCII txt files. Has
anybody else seen this?

Dave

The character is a Unicode Byte Order mark (BOM). It is used to indicate
the byte order (big or little end first) of a text file. U believe that
you should ignore this character in your application. I assumed a Reader
set for Unicode would discard it as non-text, but maybe it doesn't

Steve

Dave Stallard · May 10, 2004

Steve said:
The character is a Unicode Byte Order mark (BOM). It is used to indicate
the byte order (big or little end first) of a text file. U believe that
you should ignore this character in your application. I assumed a Reader
set for Unicode would discard it as non-text, but maybe it doesn't

Steve,

Ah! Thanks. I looked up BOM and it (U+FEFF) is indeed my nuisance
char. The odd thing is that I am reading the file with:

new BufferedReader(new InputStreamReader(new
FileInputStream(file),"UTF-8"))

and am still seeing the BOM as a character. Like you, I would have
thought that it would be ignored, but apparently it is not.

Dave

Steve Jasper · May 10, 2004

Yeah, this has happened to me quite a few times as well. I've gotten
to using the dos2unix utility in my ant build scripts to parse out
those extra characters before using the file.

Perhaps you could try the same? I found a windows Dos2Unix.exe file
that will accomplish this on windows for you as well.

Roedy Green · May 10, 2004

I found a single initial
character, Unicode point 65,279.

see http://mindprod.com/jgloss/unicode. This would be a big/little
endian marker for a unicode file.

I suggest not using notepad. There hundreds of alternatives.

Dave Stallard · May 11, 2004

Roedy said:
see http://mindprod.com/jgloss/unicode. This would be a big/little
endian marker for a unicode file.

I suggest not using notepad. There hundreds of alternatives.

I don't use Notepad much myself - I hate it. But you can't tell
customers not to use it.

Dave

Steve Horsley · May 11, 2004

Dave said:
Ah! Thanks. I looked up BOM and it (U+FEFF) is indeed my nuisance
char. The odd thing is that I am reading the file with:

new BufferedReader(new InputStreamReader(new
FileInputStream(file),"UTF-8"))

and am still seeing the BOM as a character. Like you, I would have
thought that it would be ignored, but apparently it is not.

That's a bummer. I use readLine() and startsWith(keyword) in several
places. That will break for line 1 of a config file. I need to
re-visit some code.

Steve

Dave Stallard · May 11, 2004

Steve said:
That's a bummer. I use readLine() and startsWith(keyword) in several
places. That will break for line 1 of a config file. I need to re-visit
some code.

Exactly what was happening to me. I'll be curious to see if you see the
problem also. Like I said, this was a newer XP Pro laptop - older ones
I've put this code on didn't have the problem.

Dave

Thomas Weidenfeller · May 12, 2004

Dave said:
Exactly what was happening to me. I'll be curious to see if you see the
problem also. Like I said, this was a newer XP Pro laptop - older ones
I've put this code on didn't have the problem.

Discussed, and "fixed" with a work-around long-long ago:

http://groups.google.com/[email protected]

/Thomas

Roedy Green · May 14, 2004

Unfortunately Sun gives you very little help in this regard. What
they should do is providea character encoding that looks for an
initial byte order mark and if present handles the rest of the
input according to the byte order mark. If no BOM is present it
would revert to the default for the given platform. Furthermore
this should be the default encoding paricularly for things like
javac.

It does seem strange that Java being so thoroughly unicode and so
thoroughly multiplatform that it still wants only native encoded 8-bit
source files that are not particularly portable.

Roedy Green · May 14, 2004

It does seem strange that Java being so thoroughly unicode and so
thoroughly multiplatform that it still wants only native encoded 8-bit
source files that are not particularly portable.

javac has a switch -encoding <encoding> Specify character
encoding used by source files

The problem is encodings are not self-identifying. JavaC needs extra
help to know what sort of encoding is being used, however a BOM in the
source code is a pretty good hint what sort of flavour of unicode is
being used.

see http://mindprod.com/jgloss/encoding.html

Roedy Green · May 14, 2004

The problem is encodings are not self-identifying.

At some point a file is going to have to come with a descriptor object
that tells you what format it is. For a simple text file it would
specify the encoding.

For xml it would tell you the DTD.

The descriptor would possibly contain a digital signature of the file,
copyright info etc. all in standard format so it can be computer
processed.

It would tell you what program created it/it belongs to, who its human
owner is.

It might tell you the URL where the master copy resides, and a
globally unique file number/name for use it bulk file distribution
from decentralised sources.

The idea would be the descriptor would contain enough information or
pointers to enough information that you could always view a file and
even usually edit it with generic tools.

Sudsy · May 14, 2004

Roedy Green wrote:

The idea would be the descriptor would contain enough information or
pointers to enough information that you could always view a file and
even usually edit it with generic tools.

Hmmm...almost sounds like a resource fork and a data fork.
Now where have I seen that before? MacOS?

Dale King · Apr 15, 2006

Hello, Steve Jasper!

You said:
Yeah, this has happened to me quite a few times as well. I've gotten
to using the dos2unix utility in my ant build scripts to parse out
those extra characters before using the file.

Perhaps you could try the same? I found a windows Dos2Unix.exe file
that will accomplish this on windows for you as well.

Dave Stallard <[email protected]> wrote in message

These are not nuisance characters. If the file is a text file in
some form of UTF encoding it should have a BOM. Programs should
recognize the BOM and handle it appropriately. Even lowly Windoze
Notepad is smart enough to do that.

Trying to find ways to strip them from the file is not the way to
do it.

Unfortunately Sun gives you very little help in this regard. What
they should do is providea character encoding that looks for an
initial byte order mark and if present handles the rest of the
input according to the byte order mark. If no BOM is present it
would revert to the default for the given platform. Furthermore
this should be the default encoding paricularly for things like
javac.

Perhaps I should just go ahead and create my own since Sun
refuses to do so and in 1.4 we can creat charsets.

Dale King · Apr 15, 2006

Hello, Roedy Green !

It baffles me that they can't get this one simple thing right.

javac has a switch -encoding <encoding> Specify character
encoding used by source files

Which is pretty much useless unless everyone of your source files
uses the same encoding. And that is only one tool. What about
other tools like javadoc or rmic, and other files like properties
files.

The problem is encodings are not self-identifying. JavaC needs extra
help to know what sort of encoding is being used, however a BOM in the
source code is a pretty good hint what sort of flavour of unicode is
being used.

That's the point. A file with a BOM is more than a hint, it is
self-identifying. It's not as though those bytes at the beginning
of a Java source file could have any other valid meaning to the
compiler.

Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	3	Jun 4, 2023
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
[C Language] Need help transferring Linux CodeBlocks Project to Windows CodeBlocks Project	1	Jun 19, 2023
Converting several Markdown files into DOCX with pandoc	4	Feb 1, 2023
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
KML to CSV file conversion using Python and Windows Powershell	0	Oct 14, 2022

Windows nuisance character in .txt files

Dave Stallard

Steve Horsley

Dave Stallard

Steve Jasper

Roedy Green

Dave Stallard

Steve Horsley

Dave Stallard

Thomas Weidenfeller

Roedy Green

Roedy Green

Roedy Green

Sudsy

Dale King

Dale King

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads