removing BOM prepended by codecs?

J

J. Bagg

I'm having trouble with the BOM that is now prepended to codecs files.
The files have to be read by java servlets which expect a clean file
without any BOM.

Is there a way to stop the BOM being written?

It is seriously messing up my work as the servlets do not expect it to
be there. I could delete it but that means another delay in retrieving
the data. My work is a bibliographic system and I'm writing a new search
engine in Python to replace an ancient one in C.

I'm working on Linux with a locale of en_GB.UTF8
 
S

Steven D'Aprano

I'm having trouble with the BOM that is now prepended to codecs files.
The files have to be read by java servlets which expect a clean file
without any BOM.

Is there a way to stop the BOM being written?

Of course there is :) but first we need to know how you are writing it
in the first place.

If you are dealing with existing files, which already contain a BOM, you
may need to open the files and re-save them without the BOM.

If you are dealing with temporary files you're creating programmatically,
it depends how you're creating them. My guess is that you're doing
something like this:

f = open("some file", "w", encoding="UTF-16") # or UTF-32
f.write(data)
f.close()

or similar. Both the UTF-16 and UTF-32 codecs write BOMs. To avoid that,
you should use UTF-16-BE or UTF-16-LE (Big Endian or Little Endian), as
appropriate to your platform.

If you're getting a UTF-8 BOM, that's seriously weird. The standard UTF-8
codec doesn't write a BOM. (Strictly speaking, it's not a Byte Order
Mark, but a Signature.) Unless you're using encoding='UTF-8-sig', I can't
guess how you're getting a UTF-8 BOM.

If you're doing something else, well, you'll have to explain what you're
doing before we can tell you how to stop doing it :)

I'm working on Linux with a locale of en_GB.UTF8

The locale only sets the default encoding used by the OS, not that used
by Python. Python 2 defaults to ASCII; Python 3 defaults to UTF-8.
 
W

wxjmfauth

Le mardi 24 septembre 2013 11:42:22 UTC+2, J. Bagg a écrit :
I'm having trouble with the BOM that is now prepended to codecs files.

The files have to be read by java servlets which expect a clean file

without any BOM.



Is there a way to stop the BOM being written?



It is seriously messing up my work as the servlets do not expect it to

be there. I could delete it but that means another delay in retrieving

the data. My work is a bibliographic system and I'm writing a new search

engine in Python to replace an ancient one in C.



I'm working on Linux with a locale of en_GB.UTF8



--

Dr Janet Bagg

CSAC, Dept of Anthropology,

University of Kent, UK

---------

Some points.

- The coding of a text file does not matter. What's
count is the knowledge of the coding.

- The *mark* (once the Unicode.org terminology in FAQ) indicating
a unicode encoded raw text file is neither a byte order mark,
nor a signature, it is an encoded code point, the encoded
U+FEFF, 'ZERO WIDTH NO-BREAK SPACE', code point. (Note, a
non breaking space at the start of a text is a non sense.)

- When such a mark exists, it is always possible to work
100% safely. No possible error.

- When such a mark does not exist, in many cases only
guessing is a (the) valid solution.

These are facts.


Now to the question, should I use (put) such a mark,
esp. in utf-8? I would say the following:

It seems to me, one see more and more marked utf-8 files.
(Windows is probably a reason.)

More importantly, more and more tools and software are
handling this utf-8 mark, or are corrected to support it,
so it basicaly does not hurt too much. Eg. Python, golang 1.1
(was not the case in 1.0), LibreOffice, TeXWorks supports it
now (was once not the case), the unicode TeX engines, ...

If I had to work in "archiving", it would seriously think
twice.

PS Unicode encodes characters on a per *script* ("alphabet")
basis, not per *language*.

jmf
 
C

Chris Angelico

- The *mark* (once the Unicode.org terminology in FAQ) indicating
a unicode encoded raw text file is neither a byte order mark,
nor a signature, it is an encoded code point, the encoded
U+FEFF, 'ZERO WIDTH NO-BREAK SPACE', code point. (Note, a
non breaking space at the start of a text is a non sense.)

- When such a mark exists, it is always possible to work
100% safely. No possible error.

I have a file encoded in Latin-1 which begins with LATIN SMALL LETTER
Y WITH DIAERESIS followed by LATIN SMALL LETTER THORN. I also have a
file encoded in EBCDIC (okay, I don't really, but let's pretend) that
begins with the same bytes. But of course, when such a mark exists,
there is no possible error - of that there is no manner of doubt, no
possible, probable shadow of doubt, no possible doubt whatever.

("No possible doubt whatever.")

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top