Newbie question: Unicode hiccup on reading file i just wrote

Darcy · Jan 30, 2006

hi all, i have a newbie problem arising from writing-then-reading a
unicode file, and i can't work out what syntax i need to read it in.

the syntax i'm using now (just using quick hack tmp files):
BEGIN
f=codecs.open("tt.xml","r","utf8")
fwrap=codecs.EncodedFile(f,"ascii","utf8")
try:
ss=u''
ss=fwrap.read()
print ss
## rrr=xml.dom.minidom.parseString(f.read()) # originally
finally:
f.close()
END

barfs with this error:
BEGIN
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 5092: ordinal not in range(128)
END

any ideas?

--
Context (if interested):
had a look at the blogger api, downloaded the "15 most recent posts"
into a miniDOM document, then decided to learn how to traverse the xml
object in python. getting annoyed with the time taken to reconnect each
time i played with a new syntax, i wrote the xml object to a file. that
barfed with a similar sort of encoding error. sure enough, there in the
debug coming back from blogger: "charset=utf-8". my python book said i
needed to switch from "open/print" to "codecs.open/write", so i did this:
BEGIN
# get xml doct (from blogger: atom format)
rrr=xml.dom.minidom.Document()
conn.request("GET","/atom/1234",None,headers)
response=conn.getresponse()
rrr=xml.dom.minidom.parseString(response.read())
print rrr

# dump to disk
import codecs
f=codecs.open("ttt.xml","w","utf8")
try:
## print >> f, rrr.toxml()
f.write(rrr.toxml())
finally:
f.close()
END

this works fine and the resulting file looks like good xml to the naked eye.

oh and i have tried both "utf8" and "utf-8" as the en/decoding tokens --
no change.
ditto with explicitly initialising "ss" as unicode: same error as before
when it was not explicitly initialised at all.

Diez B. Roggisch · Jan 30, 2006

Darcy said:
hi all, i have a newbie problem arising from writing-then-reading a
unicode file, and i can't work out what syntax i need to read it in.

the syntax i'm using now (just using quick hack tmp files):
BEGIN
f=codecs.open("tt.xml","r","utf8")
fwrap=codecs.EncodedFile(f,"ascii","utf8")
try:
ss=u''
ss=fwrap.read()
print ss
## rrr=xml.dom.minidom.parseString(f.read()) # originally
finally:
f.close()
END

barfs with this error:
BEGIN
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 5092: ordinal not in range(128)
END

any ideas?

Your doing things triple-time, which is this time not even half as good:

The

f=codecs.open("tt.xml","r","utf8")

gives you a file that will return unicode objects when reading. And

fwrap=codecs.EncodedFile(f,"ascii","utf8")

will wrap a normal, non-encoding-aware file to become an encoding aware
one. The result is that reading reading from the former already yields a
unicode object that is passed to the second wrapper. It will silently
pass the unicode-object - but it's useless.

And then you try and pass that unicode object of yours to the minidom.
But guess what, the minicom parser expects a (byte) string, as it reads
the mandatory xml encoding header and will decode the contents itself.
So, the passed unicode object is converted to a string beforehand,
yielding the exception you see.

Just don't do any fancy encoding stuff at all, a simple

rrr=xml.dom.minidom.parseString(open("tt.xml").read())

should do.

Diez

Fredrik Lundh · Jan 30, 2006

Diez said:
Just don't do any fancy encoding stuff at all, a simple

rrr=xml.dom.minidom.parseString(open("tt.xml").read())

should do.

or

rrr = xml.dom.minidom.parse("tt.xml")

</F>

Darcy · Jan 31, 2006

Fredrik said:
or
rrr = xml.dom.minidom.parse("tt.xml")

thanks a lot guys -- both approaches work a treat.

in particular: diez, thanks for explaining what was going on from
python's perspective

Convert unicode escape sequences to unicode in a file	1	Jan 11, 2011
Python unicode and Windows cmd.exe	10	Mar 14, 2010
q: how to output a unicode string?	5	Apr 24, 2007
compiling perl 5.8.7 on Solaris 8	3	Nov 17, 2005
REQ: Perl 5.8.3 on OpenBSD	3	Mar 6, 2004

Newbie question: Unicode hiccup on reading file i just wrote

Darcy

Diez B. Roggisch

Fredrik Lundh

Darcy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads