D
Darcy
hi all, i have a newbie problem arising from writing-then-reading a
unicode file, and i can't work out what syntax i need to read it in.
the syntax i'm using now (just using quick hack tmp files):
BEGIN
f=codecs.open("tt.xml","r","utf8")
fwrap=codecs.EncodedFile(f,"ascii","utf8")
try:
ss=u''
ss=fwrap.read()
print ss
## rrr=xml.dom.minidom.parseString(f.read()) # originally
finally:
f.close()
END
barfs with this error:
BEGIN
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 5092: ordinal not in range(128)
END
any ideas?
--
Context (if interested):
had a look at the blogger api, downloaded the "15 most recent posts"
into a miniDOM document, then decided to learn how to traverse the xml
object in python. getting annoyed with the time taken to reconnect each
time i played with a new syntax, i wrote the xml object to a file. that
barfed with a similar sort of encoding error. sure enough, there in the
debug coming back from blogger: "charset=utf-8". my python book said i
needed to switch from "open/print" to "codecs.open/write", so i did this:
BEGIN
# get xml doct (from blogger: atom format)
rrr=xml.dom.minidom.Document()
conn.request("GET","/atom/1234",None,headers)
response=conn.getresponse()
rrr=xml.dom.minidom.parseString(response.read())
print rrr
# dump to disk
import codecs
f=codecs.open("ttt.xml","w","utf8")
try:
## print >> f, rrr.toxml()
f.write(rrr.toxml())
finally:
f.close()
END
this works fine and the resulting file looks like good xml to the naked eye.
oh and i have tried both "utf8" and "utf-8" as the en/decoding tokens --
no change.
ditto with explicitly initialising "ss" as unicode: same error as before
when it was not explicitly initialised at all.
unicode file, and i can't work out what syntax i need to read it in.
the syntax i'm using now (just using quick hack tmp files):
BEGIN
f=codecs.open("tt.xml","r","utf8")
fwrap=codecs.EncodedFile(f,"ascii","utf8")
try:
ss=u''
ss=fwrap.read()
print ss
## rrr=xml.dom.minidom.parseString(f.read()) # originally
finally:
f.close()
END
barfs with this error:
BEGIN
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 5092: ordinal not in range(128)
END
any ideas?
--
Context (if interested):
had a look at the blogger api, downloaded the "15 most recent posts"
into a miniDOM document, then decided to learn how to traverse the xml
object in python. getting annoyed with the time taken to reconnect each
time i played with a new syntax, i wrote the xml object to a file. that
barfed with a similar sort of encoding error. sure enough, there in the
debug coming back from blogger: "charset=utf-8". my python book said i
needed to switch from "open/print" to "codecs.open/write", so i did this:
BEGIN
# get xml doct (from blogger: atom format)
rrr=xml.dom.minidom.Document()
conn.request("GET","/atom/1234",None,headers)
response=conn.getresponse()
rrr=xml.dom.minidom.parseString(response.read())
print rrr
# dump to disk
import codecs
f=codecs.open("ttt.xml","w","utf8")
try:
## print >> f, rrr.toxml()
f.write(rrr.toxml())
finally:
f.close()
END
this works fine and the resulting file looks like good xml to the naked eye.
oh and i have tried both "utf8" and "utf-8" as the en/decoding tokens --
no change.
ditto with explicitly initialising "ss" as unicode: same error as before
when it was not explicitly initialised at all.