Newbie question: Unicode hiccup on reading file i just wrote

D

Darcy

hi all, i have a newbie problem arising from writing-then-reading a
unicode file, and i can't work out what syntax i need to read it in.

the syntax i'm using now (just using quick hack tmp files):
BEGIN
f=codecs.open("tt.xml","r","utf8")
fwrap=codecs.EncodedFile(f,"ascii","utf8")
try:
ss=u''
ss=fwrap.read()
print ss
## rrr=xml.dom.minidom.parseString(f.read()) # originally
finally:
f.close()
END

barfs with this error:
BEGIN
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 5092: ordinal not in range(128)
END

any ideas?


--
Context (if interested):
had a look at the blogger api, downloaded the "15 most recent posts"
into a miniDOM document, then decided to learn how to traverse the xml
object in python. getting annoyed with the time taken to reconnect each
time i played with a new syntax, i wrote the xml object to a file. that
barfed with a similar sort of encoding error. sure enough, there in the
debug coming back from blogger: "charset=utf-8". my python book said i
needed to switch from "open/print" to "codecs.open/write", so i did this:
BEGIN
# get xml doct (from blogger: atom format)
rrr=xml.dom.minidom.Document()
conn.request("GET","/atom/1234",None,headers)
response=conn.getresponse()
rrr=xml.dom.minidom.parseString(response.read())
print rrr

# dump to disk
import codecs
f=codecs.open("ttt.xml","w","utf8")
try:
## print >> f, rrr.toxml()
f.write(rrr.toxml())
finally:
f.close()
END

this works fine and the resulting file looks like good xml to the naked eye.

oh and i have tried both "utf8" and "utf-8" as the en/decoding tokens --
no change.
ditto with explicitly initialising "ss" as unicode: same error as before
when it was not explicitly initialised at all.
 
D

Diez B. Roggisch

Darcy said:
hi all, i have a newbie problem arising from writing-then-reading a
unicode file, and i can't work out what syntax i need to read it in.

the syntax i'm using now (just using quick hack tmp files):
BEGIN
f=codecs.open("tt.xml","r","utf8")
fwrap=codecs.EncodedFile(f,"ascii","utf8")
try:
ss=u''
ss=fwrap.read()
print ss
## rrr=xml.dom.minidom.parseString(f.read()) # originally
finally:
f.close()
END

barfs with this error:
BEGIN
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 5092: ordinal not in range(128)
END

any ideas?

Your doing things triple-time, which is this time not even half as good:

The


f=codecs.open("tt.xml","r","utf8")

gives you a file that will return unicode objects when reading. And

fwrap=codecs.EncodedFile(f,"ascii","utf8")

will wrap a normal, non-encoding-aware file to become an encoding aware
one. The result is that reading reading from the former already yields a
unicode object that is passed to the second wrapper. It will silently
pass the unicode-object - but it's useless.

And then you try and pass that unicode object of yours to the minidom.
But guess what, the minicom parser expects a (byte) string, as it reads
the mandatory xml encoding header and will decode the contents itself.
So, the passed unicode object is converted to a string beforehand,
yielding the exception you see.

Just don't do any fancy encoding stuff at all, a simple

rrr=xml.dom.minidom.parseString(open("tt.xml").read())

should do.

Diez
 
F

Fredrik Lundh

Diez said:
Just don't do any fancy encoding stuff at all, a simple

rrr=xml.dom.minidom.parseString(open("tt.xml").read())

should do.

or

rrr = xml.dom.minidom.parse("tt.xml")

</F>
 
D

Darcy

Fredrik said:
or
rrr = xml.dom.minidom.parse("tt.xml")

thanks a lot guys -- both approaches work a treat.

in particular: diez, thanks for explaining what was going on from
python's perspective
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,701
Latest member
XavierQ83

Latest Threads

Top