R
rparimi
Hello,
I am trying to process an xml file that contains unicode characters
(see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom, I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:
.... print "childNode = ", title.childNodes
....
childNode = [<DOM Text node "Sanskrit N...">]
childNode = [<DOM Text node "Sanskrit N...">]
childNode = []
childNode = []
childNode = [<DOM Text node "1-1-1">]
childNode = Traceback (most recent call last):
File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)
Python exited when it was trying to parse the following node:
<title>अनॠ</title>
The xml header tells me that the document is UTF-8:
<?xml version="1.0" encoding="UTF-8"?>
I am running python 2.5.1 on Mac OSX 10.5.6 and my local settings are
as below:
$locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
I googled around for similar errors, and tried using unicode but that
didn't help either:
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)
I'm a novice with unicode, and am not not sure about how best to
handle the unicode text I'm dealing with (devanagari). Any
suggestions will be helpful.
Thanks
I am trying to process an xml file that contains unicode characters
(see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom, I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:
.... print "childNode = ", title.childNodes
....
childNode = [<DOM Text node "Sanskrit N...">]
childNode = [<DOM Text node "Sanskrit N...">]
childNode = []
childNode = []
childNode = [<DOM Text node "1-1-1">]
childNode = Traceback (most recent call last):
File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)
Python exited when it was trying to parse the following node:
<title>अनॠ</title>
The xml header tells me that the document is UTF-8:
<?xml version="1.0" encoding="UTF-8"?>
I am running python 2.5.1 on Mac OSX 10.5.6 and my local settings are
as below:
$locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
I googled around for similar errors, and tried using unicode but that
didn't help either:
Traceback (most recent call last):foo = unicode(titles[5].childNodes)
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)
I'm a novice with unicode, and am not not sure about how best to
handle the unicode text I'm dealing with (devanagari). Any
suggestions will be helpful.
Thanks