Parsing unicode (devanagari) text with xml.dom.minidom

rparimi · Mar 8, 2009

Hello,

I am trying to process an xml file that contains unicode characters
(see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom, I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:
.... print "childNode = ", title.childNodes
....
childNode = [<DOM Text node "Sanskrit N...">]
childNode = [<DOM Text node "Sanskrit N...">]
childNode = []
childNode = []
childNode = [<DOM Text node "1-1-1">]
childNode = Traceback (most recent call last):
File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)
Python exited when it was trying to parse the following node:
<title>à¤…à¤¨à¥ </title>

The xml header tells me that the document is UTF-8:
<?xml version="1.0" encoding="UTF-8"?>

I am running python 2.5.1 on Mac OSX 10.5.6 and my local settings are
as below:
$locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

I googled around for similar errors, and tried using unicode but that
didn't help either:

foo = unicode(titles[5].childNodes)

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)

I'm a novice with unicode, and am not not sure about how best to
handle the unicode text I'm dealing with (devanagari). Any
suggestions will be helpful.

Thanks

Stefan Behnel · Mar 8, 2009

I am trying to process an xml file that contains unicode characters
(see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom, I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:
... print "childNode = ", title.childNodes
...
childNode = [<DOM Text node "Sanskrit N...">]
childNode = [<DOM Text node "Sanskrit N...">]
childNode = []
childNode = []
childNode = [<DOM Text node "1-1-1">]
childNode = Traceback (most recent call last):
File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)

That's because you are printing it out to your console, in which case you
need to make sure it's encoded properly for printing. repr() might also help.

Regarding minidom, you might be happier with the xml.etree package that
comes with Python2.5 and later (it's also avalable for older versions).
It's a lot easier to use, more memory friendly and also much faster.

Stefan

Martin v. LÃ¶wis · Mar 8, 2009

Regarding minidom, you might be happier with the xml.etree package that

comes with Python2.5 and later (it's also avalable for older versions).
It's a lot easier to use, more memory friendly and also much faster.

OTOH, choice of XML library is completely irrelevant for the issue at
hand. If the OP is happy with minidom, we shouldn't talk him into using
something else.

Regards,
Martin

Stefan Behnel · Mar 8, 2009

Martin said:
OTOH, choice of XML library is completely irrelevant for the issue at
hand.

For the described problem, maybe. But certainly not for the application.
The background was parsing the XML dump of an entire web site, which I
would expect to be larger than what minidom is designed to handle
gracefully. Switching to cElementTree before major code gets written is
almost certainly a good idea here.

Stefan

Martin v. LÃ¶wis · Mar 8, 2009

For the described problem, maybe. But certainly not for the application.

The background was parsing the XML dump of an entire web site, which I
would expect to be larger than what minidom is designed to handle
gracefully. Switching to cElementTree before major code gets written is
almost certainly a good idea here.

I think minidom is designed to handle the very same documents taht
cElementTree is designed to handle (namely, documents that fit into
memory).

Regards,
Martin

Stefan Behnel · Mar 8, 2009

Martin said:
I think minidom is designed to handle the very same documents taht
cElementTree is designed to handle (namely, documents that fit into
memory).

I do not doubt that a machine running a cElementTree application can handle
exactly the same documents as a machine with, say, ten times as much memory
that runs a minidom application. However, when deciding which library to
choose for a new application, it does matter what hardware you want to use
it on. And if you can handle multiple times larger documents on the same
hardware, that might be as much of reason to choose cElementTree as the
(likely) shorter and more readable code (which usually translates into
shorter development and debugging times) and the higher execution speed.
Honestly, I haven't seen a reason in a while why preferring minidom over
any of the ElementTree derivates would be a good idea when starting a new
application.

Stefan

rparimi · Mar 8, 2009

[email protected] said:
[email protected] said:

I am trying to process an xml file that contains unicode characters
(seehttp://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom, I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:

import xml.dom.minidom
dom = xml.dom.minidom.parse("wordpress.2009-02-19.xml")
titles = dom.getElementsByTagName("title")
for title in titles:

Click to expand...

... print "childNode = ", title.childNodes
...
childNode = [<DOM Text node "Sanskrit N...">]
childNode = [<DOM Text node "Sanskrit N...">]
childNode = []
childNode = []
childNode = [<DOM Text node "1-1-1">]
childNode = Traceback (most recent call last):
File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)

Click to expand...

That's because you are printing it out to your console, in which case you
need to make sure it's encoded properly for printing. repr() might also help.

Regarding minidom, you might be happier with the xml.etree package that
comes with Python2.5 and later (it's also avalable for older versions).
It's a lot easier to use, more memory friendly and also much faster.

Stefan

Thanks for the reply. I didn't realize that printing to console was
causing the problem. I am now able to parse out the relevant portions
of my xml file. Will also look at the xml.etree module.

Regards

XML parsing ExpatError with xml.dom.minidom at line 1, column 0	2	Feb 13, 2014
Thinking Unicode	0	Aug 8, 2013
Unicode error	19	Jul 23, 2010
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
SAX unicode and ascii parsing problem	4	Nov 30, 2010
helping with unicode	4	Jul 3, 2012
pexpect and unicode strings	1	Sep 5, 2009
Yet another unicode WTF	9	Jun 5, 2009

Parsing unicode (devanagari) text with xml.dom.minidom

rparimi

Stefan Behnel

Martin v. LÃ¶wis

Stefan Behnel

Martin v. LÃ¶wis

Stefan Behnel

rparimi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads