Parsing unicode (devanagari) text with xml.dom.minidom

R

rparimi

Hello,

I am trying to process an xml file that contains unicode characters
(see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom, I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:
.... print "childNode = ", title.childNodes
....
childNode = [<DOM Text node "Sanskrit N...">]
childNode = [<DOM Text node "Sanskrit N...">]
childNode = []
childNode = []
childNode = [<DOM Text node "1-1-1">]
childNode = Traceback (most recent call last):
File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)
Python exited when it was trying to parse the following node:
<title>अनॠ</title>

The xml header tells me that the document is UTF-8:
<?xml version="1.0" encoding="UTF-8"?>

I am running python 2.5.1 on Mac OSX 10.5.6 and my local settings are
as below:
$locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=


I googled around for similar errors, and tried using unicode but that
didn't help either:
foo = unicode(titles[5].childNodes)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)

I'm a novice with unicode, and am not not sure about how best to
handle the unicode text I'm dealing with (devanagari). Any
suggestions will be helpful.

Thanks
 
S

Stefan Behnel

I am trying to process an xml file that contains unicode characters
(see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom, I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:
... print "childNode = ", title.childNodes
...
childNode = [<DOM Text node "Sanskrit N...">]
childNode = [<DOM Text node "Sanskrit N...">]
childNode = []
childNode = []
childNode = [<DOM Text node "1-1-1">]
childNode = Traceback (most recent call last):
File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)

That's because you are printing it out to your console, in which case you
need to make sure it's encoded properly for printing. repr() might also help.

Regarding minidom, you might be happier with the xml.etree package that
comes with Python2.5 and later (it's also avalable for older versions).
It's a lot easier to use, more memory friendly and also much faster.

Stefan
 
M

Martin v. Löwis

Regarding minidom, you might be happier with the xml.etree package that
comes with Python2.5 and later (it's also avalable for older versions).
It's a lot easier to use, more memory friendly and also much faster.

OTOH, choice of XML library is completely irrelevant for the issue at
hand. If the OP is happy with minidom, we shouldn't talk him into using
something else.

Regards,
Martin
 
S

Stefan Behnel

Martin said:
OTOH, choice of XML library is completely irrelevant for the issue at
hand.

For the described problem, maybe. But certainly not for the application.
The background was parsing the XML dump of an entire web site, which I
would expect to be larger than what minidom is designed to handle
gracefully. Switching to cElementTree before major code gets written is
almost certainly a good idea here.

Stefan
 
M

Martin v. Löwis

For the described problem, maybe. But certainly not for the application.
The background was parsing the XML dump of an entire web site, which I
would expect to be larger than what minidom is designed to handle
gracefully. Switching to cElementTree before major code gets written is
almost certainly a good idea here.

I think minidom is designed to handle the very same documents taht
cElementTree is designed to handle (namely, documents that fit into
memory).

Regards,
Martin
 
S

Stefan Behnel

Martin said:
I think minidom is designed to handle the very same documents taht
cElementTree is designed to handle (namely, documents that fit into
memory).

I do not doubt that a machine running a cElementTree application can handle
exactly the same documents as a machine with, say, ten times as much memory
that runs a minidom application. However, when deciding which library to
choose for a new application, it does matter what hardware you want to use
it on. And if you can handle multiple times larger documents on the same
hardware, that might be as much of reason to choose cElementTree as the
(likely) shorter and more readable code (which usually translates into
shorter development and debugging times) and the higher execution speed.
Honestly, I haven't seen a reason in a while why preferring minidom over
any of the ElementTree derivates would be a good idea when starting a new
application.

Stefan
 
R

rparimi

I am trying to process an xml file that contains unicode characters
(seehttp://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom,  I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:
import xml.dom.minidom
dom = xml.dom.minidom.parse("wordpress.2009-02-19.xml")
titles = dom.getElementsByTagName("title")
for title in titles:
...    print "childNode = ", title.childNodes
...
childNode =  [<DOM Text node "Sanskrit N...">]
childNode =  [<DOM Text node "Sanskrit N...">]
childNode =  []
childNode =  []
childNode =  [<DOM Text node "1-1-1">]
childNode =  Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)

That's because you are printing it out to your console, in which case you
need to make sure it's encoded properly for printing. repr() might also help.

Regarding minidom, you might be happier with the xml.etree package that
comes with Python2.5 and later (it's also avalable for older versions).
It's a lot easier to use, more memory friendly and also much faster.

Stefan

Thanks for the reply. I didn't realize that printing to console was
causing the problem. I am now able to parse out the relevant portions
of my xml file. Will also look at the xml.etree module.

Regards
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,955
Messages
2,570,117
Members
46,705
Latest member
v_darius

Latest Threads

Top