fun with unicode files

Thomas Heller · Aug 24, 2004

I want to use ConfigParser with both NT4-style .reg files, which are
ascii (or ansi?) files, and XP-stype .reg files which seem to be UTF-16
encoded unicode-files (hope that's the correct terminology). [And yes, I
have read the warning in the manual that ConfigParser doesn't interpret
the value-type prefixes in the reg files]

Here's the start of the method I wrote to detect the encoding and read
the file:

def _parse_regfile(self, filename):
ifi = open(filename, "r")
import codecs, StringIO
if ifi.read(2) in (codecs.BOM_LE, codecs.BOM_BE):
ifi.close()
ifi = codecs.open(filename, "r", "utf-16")

I wonder: do I really have to check for the BOM manually, or is there a
Python function which does that?
Continuing the code:

# ConfigParser calls .readline(), but:
# NotImplementedError: '.readline() is not implemented for UTF-16'
# so we need to put the data into a StringIO instance.
# Um, cStringIO doesn't handle unicode correctly, so we'll have
# to use the slower StringIO
ifi = StringIO.StringIO(ifi.read())
ifi.readline() # skip the first two lines
ifi.readline()
c = ConfigParser()
c.readfp(ifi)
return c

Is there a better way to do this? Why doesn't the UTF-16 codec
implement readline()?

Thomas

Roger Binns · Aug 24, 2004

Thomas said:
I wonder: do I really have to check for the BOM manually, or is there a
Python function which does that?

It should be part of the standard library IMHO.

Here is my own more complete implementation:

http://www.bitpim.org/pyxr/c/projects/bitpim/common.py.html#0286

Roger

Jason Diamond · Aug 24, 2004

Roger said:
Here is my own more complete implementation:

http://www.bitpim.org/pyxr/c/projects/bitpim/common.py.html#0286

When I do import encodings.utf_32 on Python 2.3.3, I get an ImportError.
What verson of Python are you using where that works?

I can't find any documentation of the encodings module in the Python
Library Reference. Where can I read more about it?

-- Jason

Jason Diamond · Aug 24, 2004

Jason said:
When I do import encodings.utf_32 on Python 2.3.3, I get an ImportError.
What verson of Python are you using where that works?

I can't find any documentation of the encodings module in the Python
Library Reference. Where can I read more about it?

Never mind. I continued reading your code and found the comment about
those modules not existing (yet).

-- Jason

Thomas Heller · Aug 24, 2004

Roger Binns said:
It should be part of the standard library IMHO.

Here is my own more complete implementation:

http://www.bitpim.org/pyxr/c/projects/bitpim/common.py.html#0286

Ah, thanks. It looks like I basically got it right, although your
solution is really more complete.

Thomas

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Aug 24, 2004

Thomas said:
I wonder: do I really have to check for the BOM manually, or is there a
Python function which does that?

If it can also be ASCII (or ansi?), then yes, you need to manually check
for the BOM. This is because you need to make an explicit decision in
the fallback case - Python cannot know whether it is ASCII if it is
not UTF-16. For example, it might also be Latin-1 or UTF-8 if it is not
UTF-16, or, say, iso-2022-jp.

Regards,
Martin

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · May 19, 2007

Thomas said:
I wonder: do I really have to check for the BOM manually, or is there a
Python function which does that?

If it can also be ASCII (or ansi?), then yes, you need to manually check
for the BOM. This is because you need to make an explicit decision in
the fallback case - Python cannot know whether it is ASCII if it is
not UTF-16. For example, it might also be Latin-1 or UTF-8 if it is not
UTF-16, or, say, iso-2022-jp.

Regards,
Martin

helping with unicode	4	Jul 3, 2012
StringIO + unicode	1	Mar 25, 2008
Opening Unicode files?	7	Dec 25, 2011
Convert unicode escape sequences to unicode in a file	1	Jan 11, 2011
xhtml encoding question	8	Jan 31, 2012
Trouble with UnicodeEncodeError and email	0	Jan 8, 2014
q: how to output a unicode string?	5	Apr 24, 2007
unicode shutil.copy() changes a file name during copy?	6	Feb 16, 2011

fun with unicode files

Thomas Heller

Roger Binns

Jason Diamond

Jason Diamond

Thomas Heller

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads