unicode and xml/xsl

M

Matt Price

Hello,

I'm a python (& xml, & unicode!) newbie working on an interface to a
bibliographic reference server (refdb); I'm running into some encoding
problems & am ifnding the plethora of tools a little confusing. Here
is the basic situation:

I connect to the server and receive an xml document whose content is a
bibliographic dataset. The document can be encoded in two ways:
ISO-8859-1 or unicode. My program simply takes the document and
passes it to an xsl stylesleet using libxslt & libxml2. Here's the
relevant code:

# this is how I get the results & generate either a string or a
# unicode string
def getref (self, query = ':ID:>0', cmd = 'getref ',
reftype = default_reftype):
cmd += ' ' + query
self.send(cmd + self.CS_TERM)
results = self.tread()
if self.encoding == 'UNICODE':
print ' decoding unicode string: '
results = results.decode('utf-8', 'replace')
return results


# this is where I generate the html:
def risx_to_html (self, risxSet, xsl = xsl_ss,
css=css_url, use_css = 1):
styledoc = libxml2.parseFile(xsl)
style = libxslt.parseStylesheetDoc(styledoc)
doc = libxml2.parseDoc(risxSet)
result = style.applyStylesheet(doc, None)
# style.saveResultToFilename("results.html", result, 0)
htmlString = style.saveResultToString(result)
style.freeStylesheet()
doc.freeDoc()
result.freeDoc()
return htmlString

The server's default encoding is iso-8859-1, and since I mosly use
english-language references, this usually works fine; but occasionally
the server will pass me an entity like 'μ' (for Greek letter mu).
This generates an error like this:

Entity: line 57: parser error : Entity 'mu' not defined

This is not so bad, because the parsing continues nonetheless. With
unicode it's worse. In this case there are several errors depending
on how I set the system up:

with iso-8859-1 set as default encoding in sitecustomize.py:

File "/home/matt/bibpython/refdbclient.py", line 268, in risx_to_html
doc = libxml2.parseDoc(risxSet)
File "/usr/lib/python2.3/site-packages/libxml2.py", line 1149, in parseDoc
ret = libxml2mod.xmlParseDoc(cur)
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)

with utf-8 set as default encoding:
File "/home/matt/bibpython/refdbclient.py", line 268, in risx_to_html
doc = libxml2.parseDoc(risxSet)
File "/usr/lib/python2.3/site-packages/libxml2.py", line 1149, in parseDoc
ret = libxml2mod.xmlParseDoc(cur)
TypeError: xmlParseDoc() argument 1 must be string without null bytes or None, not unicode

So I guess I have two questions:

(1) am I using the right python tools for this job? My excellent
python books unfortunately don't cover either unicode or xml in much
depth, so I'm a little uncertain as te whtehr I'm doing the right
thing.

(2) Is there a way to make libxml2 parse unicode documents? Do I need
to pass it more information alerting it that it's getting unicode?

Anyway, thanks very much for your help. Much appreciated,

Matt
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top