fun with unicode files

T

Thomas Heller

I want to use ConfigParser with both NT4-style .reg files, which are
ascii (or ansi?) files, and XP-stype .reg files which seem to be UTF-16
encoded unicode-files (hope that's the correct terminology). [And yes, I
have read the warning in the manual that ConfigParser doesn't interpret
the value-type prefixes in the reg files]

Here's the start of the method I wrote to detect the encoding and read
the file:

def _parse_regfile(self, filename):
ifi = open(filename, "r")
import codecs, StringIO
if ifi.read(2) in (codecs.BOM_LE, codecs.BOM_BE):
ifi.close()
ifi = codecs.open(filename, "r", "utf-16")

I wonder: do I really have to check for the BOM manually, or is there a
Python function which does that?
Continuing the code:

# ConfigParser calls .readline(), but:
# NotImplementedError: '.readline() is not implemented for UTF-16'
# so we need to put the data into a StringIO instance.
# Um, cStringIO doesn't handle unicode correctly, so we'll have
# to use the slower StringIO
ifi = StringIO.StringIO(ifi.read())
ifi.readline() # skip the first two lines
ifi.readline()
c = ConfigParser()
c.readfp(ifi)
return c

Is there a better way to do this? Why doesn't the UTF-16 codec
implement readline()?

Thomas
 
J

Jason Diamond

Jason said:
When I do import encodings.utf_32 on Python 2.3.3, I get an ImportError.
What verson of Python are you using where that works?

I can't find any documentation of the encodings module in the Python
Library Reference. Where can I read more about it?

Never mind. I continued reading your code and found the comment about
those modules not existing (yet).

-- Jason
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Thomas said:
I wonder: do I really have to check for the BOM manually, or is there a
Python function which does that?

If it can also be ASCII (or ansi?), then yes, you need to manually check
for the BOM. This is because you need to make an explicit decision in
the fallback case - Python cannot know whether it is ASCII if it is
not UTF-16. For example, it might also be Latin-1 or UTF-8 if it is not
UTF-16, or, say, iso-2022-jp.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Thomas said:
I wonder: do I really have to check for the BOM manually, or is there a
Python function which does that?

If it can also be ASCII (or ansi?), then yes, you need to manually check
for the BOM. This is because you need to make an explicit decision in
the fallback case - Python cannot know whether it is ASCII if it is
not UTF-16. For example, it might also be Latin-1 or UTF-8 if it is not
UTF-16, or, say, iso-2022-jp.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,001
Messages
2,570,255
Members
46,853
Latest member
GeorgiaSta

Latest Threads

Top