removing BOM prepended by codecs?

J. Bagg · Sep 24, 2013

I'm having trouble with the BOM that is now prepended to codecs files.
The files have to be read by java servlets which expect a clean file
without any BOM.

Is there a way to stop the BOM being written?

It is seriously messing up my work as the servlets do not expect it to
be there. I could delete it but that means another delay in retrieving
the data. My work is a bibliographic system and I'm writing a new search
engine in Python to replace an ancient one in C.

I'm working on Linux with a locale of en_GB.UTF8

Steven D'Aprano · Sep 24, 2013

I'm having trouble with the BOM that is now prepended to codecs files.
The files have to be read by java servlets which expect a clean file
without any BOM.

Is there a way to stop the BOM being written?

Of course there is

but first we need to know how you are writing it
in the first place.

If you are dealing with existing files, which already contain a BOM, you
may need to open the files and re-save them without the BOM.

If you are dealing with temporary files you're creating programmatically,
it depends how you're creating them. My guess is that you're doing
something like this:

f = open("some file", "w", encoding="UTF-16") # or UTF-32
f.write(data)
f.close()

or similar. Both the UTF-16 and UTF-32 codecs write BOMs. To avoid that,
you should use UTF-16-BE or UTF-16-LE (Big Endian or Little Endian), as
appropriate to your platform.

If you're getting a UTF-8 BOM, that's seriously weird. The standard UTF-8
codec doesn't write a BOM. (Strictly speaking, it's not a Byte Order
Mark, but a Signature.) Unless you're using encoding='UTF-8-sig', I can't
guess how you're getting a UTF-8 BOM.

If you're doing something else, well, you'll have to explain what you're
doing before we can tell you how to stop doing it

I'm working on Linux with a locale of en_GB.UTF8

The locale only sets the default encoding used by the OS, not that used
by Python. Python 2 defaults to ASCII; Python 3 defaults to UTF-8.

wxjmfauth · Sep 24, 2013

Le mardi 24 septembre 2013 11:42:22 UTC+2, J. Bagg a écrit :

I'm having trouble with the BOM that is now prepended to codecs files.

The files have to be read by java servlets which expect a clean file

without any BOM.

Is there a way to stop the BOM being written?

It is seriously messing up my work as the servlets do not expect it to

be there. I could delete it but that means another delay in retrieving

the data. My work is a bibliographic system and I'm writing a new search

engine in Python to replace an ancient one in C.

I'm working on Linux with a locale of en_GB.UTF8

--

Dr Janet Bagg

CSAC, Dept of Anthropology,

University of Kent, UK

---------

Some points.

- The coding of a text file does not matter. What's
count is the knowledge of the coding.

- The *mark* (once the Unicode.org terminology in FAQ) indicating
a unicode encoded raw text file is neither a byte order mark,
nor a signature, it is an encoded code point, the encoded
U+FEFF, 'ZERO WIDTH NO-BREAK SPACE', code point. (Note, a
non breaking space at the start of a text is a non sense.)

- When such a mark exists, it is always possible to work
100% safely. No possible error.

- When such a mark does not exist, in many cases only
guessing is a (the) valid solution.

These are facts.

Now to the question, should I use (put) such a mark,
esp. in utf-8? I would say the following:

It seems to me, one see more and more marked utf-8 files.
(Windows is probably a reason.)

More importantly, more and more tools and software are
handling this utf-8 mark, or are corrected to support it,
so it basicaly does not hurt too much. Eg. Python, golang 1.1
(was not the case in 1.0), LibreOffice, TeXWorks supports it
now (was once not the case), the unicode TeX engines, ...

If I had to work in "archiving", it would seriously think
twice.

PS Unicode encodes characters on a per *script* ("alphabet")
basis, not per *language*.

jmf

Chris Angelico · Sep 24, 2013

- The *mark* (once the Unicode.org terminology in FAQ) indicating
a unicode encoded raw text file is neither a byte order mark,
nor a signature, it is an encoded code point, the encoded
U+FEFF, 'ZERO WIDTH NO-BREAK SPACE', code point. (Note, a
non breaking space at the start of a text is a non sense.)

- When such a mark exists, it is always possible to work
100% safely. No possible error.

I have a file encoded in Latin-1 which begins with LATIN SMALL LETTER
Y WITH DIAERESIS followed by LATIN SMALL LETTER THORN. I also have a
file encoded in EBCDIC (okay, I don't really, but let's pretend) that
begins with the same bytes. But of course, when such a mark exists,
there is no possible error - of that there is no manner of doubt, no
possible, probable shadow of doubt, no possible doubt whatever.

("No possible doubt whatever.")

ChrisA

removing BOM prepended by codecs?	0	Sep 24, 2013
removing BOM prepended by codecs?	0	Sep 24, 2013
removing BOM prepended by codecs?	1	Sep 24, 2013
Sniffing encoding type by looking at file BOM header	2	Mar 24, 2010
Proper use of the codecs module.	3	Aug 16, 2013
How to create python codecs?	0	Aug 6, 2008
lxml removing tag, keeping text order	2	Oct 24, 2008
Removing .DS_Store files from mac folders	11	Mar 2, 2006

removing BOM prepended by codecs?

J. Bagg

Steven D'Aprano

wxjmfauth

Chris Angelico

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads