Parsing XML with ElementTree (unicode problem?)

O

oren.tsur

(this question was also posted in the devshed python forum:
http://forums.devshed.com/python-pr...-with-elementtree-unicode-problem-461518.html
).
-----------------------------

(it's a bit longish but I hope I give all the information)

1. here is my problem: I'm trying to parse an XML file (saved locally)
using elementtree.parse but I get the following error:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line
13, column 327
apparently, the problem is caused by the token 'Saunière' due to the
apostrophe.

the thing is that I'm sure that python (ElementTree module and parse()
function) can handle this type of encoding since I obtain my xml file
from the web by opening it with:

from elementtree import ElementTree
from urllib import urlopen
query = r'http://ecs.amazonaws.com/onca/xml?
Service=AWSECommerceService&AWSAccessKeyId=189P5TE3VP7N9MN0G302&Operation=ItemLookup&ItemId=1400079179&ResponseGroup=Reviews&ReviewPage=166'
root = ElementTree.parse(urlopen(query))

where query is a query to the AWS, and this specific query has the
'Saunière' in the response. (you could simply open the query with a
web browser and see the xml).

I create a local version of the XML file, containing only the tags
that are of interest. my file looks something like this (I replaced
some of the content with 'bla bla' string in order to make it fit
here):
<ReviewBatch>
<Review>
<ID>805</ID> <Rating>3</Rating>
<HelpfulVotes>5</HelpfulVotes> <TotalVotes>6</TotalVotes>
<Date>2004-04-03</Date>
<Summary>Not as good as Angels and Demons</Summary>
<Content>I found that this book was not as good and thrilling as
Angels and Demons. bla bla.</Content>
</Review>

<Review>
<ID>827</ID> <Rating>4</Rating>
<HelpfulVotes>2</HelpfulVotes> <TotalVotes>8</TotalVotes>
<Date>2004-04-01</Date>
<Summary>The Da Vinci Code, a master piece of words</Summary>
<Content>The Da Vinci Code by Dan Brown is a well-written bla bla. The
story starts out in Paris, France with a murder of Jacque Saunière,
the head curator at Le Louvre.bla bla </Content>
</Review>
</ReviewBatch>

BUT, then trying:

fIn = open(file,'r') #or even 'import codecs' and opening with 'fIn
= codecs.open(file,encoding = 'utf-8')'
tree = ElementTree.parse(fIn)



where file is the saved file, I get the error above
(xml.parsers.expat.ExpatError: not well-formed (invalid token): line
13, column 327). so what's the difference? how comes parsing is fine
in the first case but erroneous in the second case? please advise.

2. there is another problem that might be similar I get a similar
error if the content of the (locally saved) xml have special
characters such as '&', for example in 'angles & demons' (vs. 'angles
and demons'). is it the same problem? same solution?

thanks!
 
R

Richard Brodie

so what's the difference? how comes parsing is fine
in the first case but erroneous in the second case?

You may have guessed the encoding wrong. It probably
wasn't utf-8 to start with but iso8859-1 or similar.
What actual byte value is in the file?
2. there is another problem that might be similar I get a similar
error if the content of the (locally saved) xml have special
characters such as '&'

Either the originator of the XML has messed up, or whatever
you have done to save a local copy has mangled it.
 
O

oren.tsur

You may have guessed the encoding wrong. It probably
wasn't utf-8 to start with but iso8859-1 or similar.
What actual byte value is in the file?

I tried it with different encodings and it didn't work. Anyways, I
would expect it to be utf-8 since the XML response to the amazon query
indicates a utf-8 (check it with
http://ecs.amazonaws.com/onca/xml?S...00079179&ResponseGroup=Reviews&ReviewPage=166

in your browser, the first line in the source is <?xml version="1.0"
encoding="UTF-8"?>)

but the thing is that the parser parses it all right from the web (the
amazon response) but fails to parse the locally saved file.
Either the originator of the XML has messed up, or whatever
you have done to save a local copy has mangled it.

I think i made a mess. I changed the '&' in the original response to
'and' because the parser failed to parse the '&' (in the locally saved
file) just like it failed with the French characters. Again, parsing
the original response was just fine.

Thanks again,

Oren
 
S

Stefan Behnel

I tried it with different encodings and it didn't work. Anyways, I
would expect it to be utf-8 since the XML response to the amazon query
indicates a utf-8 (check it with
http://ecs.amazonaws.com/onca/xml?S...00079179&ResponseGroup=Reviews&ReviewPage=166

in your browser, the first line in the source is <?xml version="1.0"
encoding="UTF-8"?>)

but the thing is that the parser parses it all right from the web (the
amazon response) but fails to parse the locally saved file.

Then how did you save it to a file? Using your browser? Maybe that messed it
up? Or did you edit it with an Editor that doesn't understand UTF-8?

If you want to extract the interesting stuff programmatically, you can use
lxml.etree. It's ElementTree compatible, but it can parse right from HTTP URLs
and it supports XPath for selecting stuff.

http://codespeak.net/lxml/

Stefan
 
M

Marc 'BlackJack' Rintsch

but the thing is that the parser parses it all right from the web (the
amazon response) but fails to parse the locally saved file.

I've just used wget to fetch that URL and `ElementTree` parses that local
file without problems.

Maybe you should stop searching the explanation within Python or
`ElementTree` and accept having a broken XML file on your disk. :)

Have you checked the local XML file with something like `xmllint` or
another XML parser already?

Ciao,
Marc 'BlackJack' Rintsch
 
S

Steve Holden

Marc said:
I've just used wget to fetch that URL and `ElementTree` parses that local
file without problems.

Maybe you should stop searching the explanation within Python or
`ElementTree` and accept having a broken XML file on your disk. :)

Have you checked the local XML file with something like `xmllint` or
another XML parser already?

Ciao,
Marc 'BlackJack' Rintsch

You should also realise that your posting compromised the Access ID
embedded in the URL. If that was live it might be a good idea to replace it.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------
 
?

=?iso-8859-1?B?QW5kcuk=?=

(this question was also posted in the devshed python forum:http://forums.devshed.com/python-programming-11/parsing-xml-with-elem...
).
-----------------------------

(it's a bit longish but I hope I give all the information)

1. here is my problem: I'm trying to parse an XML file (saved locally)
using elementtree.parse but I get the following error:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line
13, column 327
apparently, the problem is caused by the token 'Saunière' due to the
apostrophe.

the thing is that I'm sure that python (ElementTree module and parse()
function) can handle this type of encoding since I obtain my xml file
from the web by opening it with:

from elementtree import ElementTree
from urllib import urlopen
query = r'http://ecs.amazonaws.com/onca/xml?
Service=AWSECommerceService&AWSAccessKeyId=189P5TE3VP7N9MN0G302&Operation=ItemLookup&ItemId=1400079179&ResponseGroup=Reviews&ReviewPage=166'
root = ElementTree.parse(urlopen(query))
How about trying
root = ElementTree.parse(urlopen(query), encoding ='utf-8')

André
 
O

oren.tsur

How about trying
root = ElementTree.parse(urlopen(query), encoding ='utf-8')

this specific thing is not working, however, parsing the url is not
problematic. the problem is that after parsing the xml at the url I
save some of the fields to a local file and the local file is not
being parsed properly due to the non-ascii characters Sauni\xc3\xa8re
(french name: Saunière).

an example of the file can be found in the first posting, you could
copy+paste+save it to your machine then try to parse it.

I'm quite new to xml and python so I guess there must be something
wrong or dumb in the way I save the file (maybe I miss some important
tags?) or in the way I re-open it but I can't find whats wrong.
 
S

Stefan Behnel

That doesn't work.

this specific thing is not working, however, parsing the url is not
problematic.

So you tried parsing the complete XML file and it works? Then it's the way you
stripped it down to the interesting parts that broke it. Not ElementTree's fault.

the problem is that after parsing the xml at the url I
save some of the fields to a local file and the local file is not
being parsed properly due to the non-ascii characters Sauni\xc3\xa8re
(french name: Saunière).

That looks like it parsed UTF-8 as some single byte encoding, such as
iso-8859-1. Check if the file you saved retained the XML declaration

I'm quite new to xml and python so I guess there must be something
wrong or dumb in the way I save the file (maybe I miss some important
tags?) or in the way I re-open it but I can't find whats wrong.

As I said, try to read the interesting portions of the XML file
programmatically (especially if you want to do it more than once), or use an
editor that supports UTF-8 and/or XML when you edit it (i.e.: use an editor).
Make sure the XML file is well-formed (use e.g. xmllint) when you're save it.
Otherwise, no XML parser will accept it.

Stefan
 
O

oren.tsur

OK, I solved the problem but I still don't get what went wrong.
Solution - use tree builder in order to create the new xml file
(previously I was "manually" creating it).

I'm still curious so I'm adding a link to a short and very simple
script that gets an xml (containing non ascii chars) from the web and
saves some of the elements to 2 different local xml files - one is
created by XMLWriter and the other is created manually. you could see
that parsing of the first local file is OK while parsing of the
"manually" created xml file fails. obviously I'm doing something wrong
and I'd love to learn what.

the toy script:
http://staff.science.uva.nl/~otsur/code/xmlConversions.py

Thaks for all your help,

Oren
 
J

John Machin

OK, I solved the problem but I still don't get what went wrong.
Solution - use tree builder in order to create the new xml file
(previously I was "manually" creating it).

I'm still curious so I'm adding a link to a short and very simple
script that gets an xml (containing non ascii chars) from the web and
saves some of the elements to 2 different local xml files - one is
created by XMLWriter and the other is created manually. you could see
that parsing of the first local file is OK while parsing of the
"manually" created xml file fails. obviously I'm doing something wrong
and I'd love to learn what.

the toy script:http://staff.science.uva.nl/~otsur/code/xmlConversions.py

Simple file comparison:

File 1: ... Modern Church. &lt;p&gt;The book ...
File 2: ... Modern Church. <p>The book ...

Firefox:

XML Parsing Error: mismatched tag. Expected: </p>.
Location: file:///C:/junk/myDeVinciCode166_2.xml
Line Number 3, Column 1153:

<CONTENT>The...Church. <p>The...thrill.</CONTENT>
------------------------------------------^
 
O

oren.tsur

Simple file comparison:

File 1: ... Modern Church. &lt;p&gt;The book ...
File 2: ... Modern Church. <p>The book ...

Firefox:

XML Parsing Error: mismatched tag. Expected: </p>.
Location: file:///C:/junk/myDeVinciCode166_2.xml
Line Number 3, Column 1153:

<CONTENT>The...Church. <p>The...thrill.</CONTENT>
------------------------------------------^

yup, but why does this happen - on the script side - I write the exact
same strings, of content with supposedly, same encoding, so why the
encoding is different?
 
S

Stefan Behnel

yup, but why does this happen - on the script side - I write the exact
same strings, of content with supposedly, same encoding, so why the
encoding is different?

Read the mail. It's not the encoding, it's the "<p>" which does not get
through as a tag in the first file.

Stefan
 
O

oren.tsur

Read the mail. It's not the encoding, it's the "<p>" which does not get
through as a tag in the first file.

Stefan

thanks. I guess it was a dumb question after all. thanks again :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,187
Members
46,729
Latest member
ScarlettJe

Latest Threads

Top