print UTF-8 file with BOM

davihigh · Dec 23, 2005

Hi Friends:

fileObj = codecs.open( filename, "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in
the file
print u

It says error:
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in
position 0:
illegal multibyte sequence

I want to know how read from UTF-8 file, and convert to specified
locale (default is current system locale) and print out string. I hope
put away BOM header automatically.

Rgds, David

davihigh · Dec 23, 2005

FYI. I had just receive something from a friend, he give me following
nice example!

I have one more question on this: How to write if I want to specify
locale other than current locale? For example, program runn on Korea
locale system, and try reading a UTF-8 file that save chinese
characters.

-------------- The code is here --------------------
import codecs
def read_utf8_txt_file (filename):
fileObj = codecs.open( filename, "r", "utf-8" )
content = fileObj.read()
content = content[1:] #exclude BOM
print content
fileObj.close()

Carsten Haese · Dec 23, 2005

2005/12/23 said:
Hi Kuan:

Thanks a lot! One more question here: How to write if I want
to
specify locale other than current locale?

For example, running on Korea locale system, and try read a
UTF-8 file
that save chinese.

Use the encode method to translate the unicode object into whatever
encoding you want.

unicodeStr = ...
print unicodeStr.encode('big5')

Hope this helps,

Carsten.

John Bauman · Dec 23, 2005

UTF-8 shouldn't need a BOM, as it is designed for character streams, and
there is only one logical ordering of the bytes. Only UTF-16 and greater
should output a BOM, AFAIK.

=?ISO-8859-1?Q?Walter_D=F6rwald?= · Dec 23, 2005

John said:
UTF-8 shouldn't need a BOM, as it is designed for character streams, and
there is only one logical ordering of the bytes. Only UTF-16 and greater
should output a BOM, AFAIK.

However there's a pending patch (http://bugs.python.org/1177307) for a
new encoding named utf-8-sig, that would output a leading BOM on writing
and skip it on reading.

Bye,
Walter Dörwald

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Dec 23, 2005

John said:
UTF-8 shouldn't need a BOM, as it is designed for character streams, and
there is only one logical ordering of the bytes. Only UTF-16 and greater
should output a BOM, AFAIK.

Yes and no. Yes, UTF-8 does not need a BOM to identify endianness. No,
usage of the BOM with UTF-8 is explicitly allowed in the Unicode specs
(so output of the BOM doesn't *have* to be restricted to UTF-16 and
greater), and the BOM has a well-defined meaning for UTF-8 (namely,
as the UTF-8 signature).

Regards,
Martin

codec for UTF-8 with BOM	3	May 2, 2011
Read utf-8 file	1	Mar 18, 2013
2to3 ParseError with UTF-8 BOM	3	Nov 5, 2009
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Q: Cteni unicode retezcu ze souboru UTF-8 s BOM?	0	Mar 14, 2007
the same strings, different utf-8 repr values?	2	Sep 7, 2006
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
Forcing any output (file / stdout) to UTF-8	0	Jun 6, 2010

print UTF-8 file with BOM

davihigh

davihigh

Carsten Haese

John Bauman

=?ISO-8859-1?Q?Walter_D=F6rwald?=

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads