processing XHTML1.1 documents with xml.sax

webworldL · Aug 7, 2004

Has anybody had any luck processing XHTML1.1 documents with xml.sax?
Whenever I try it, python loads the W3C DTD from the top, then crashes
saying that there's an error in the external DTD.
All I need to do is rip through a bunch of XHTML documents and extract
some data, does anybody know a quick way to do this without sax making
outgoing network connections and fussing with DTDs?

BTW, the code to reproduce the error if anybody cares:
below is a document 'hello.html' produced by the W3C's Amaya:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1" />
<title>Hello World</title>
<meta name="generator" content="amaya 8.5, see
http://www.w3.org/Amaya/" />
</head>

<body>
<p>hello world!</p>
</body>
</html>

and the script:

import xml.sax.handler
xml.sax.parse("hello.html",
xml.sax.handler.ContentHandler()
)

the error:

SAXParseException:
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-framework-1.mod:89:0:
error in processing external entity reference

will be thrown.

Uche Ogbuji · Aug 9, 2004

Has anybody had any luck processing XHTML1.1 documents with xml.sax?
Whenever I try it, python loads the W3C DTD from the top, then crashes
saying that there's an error in the external DTD.
All I need to do is rip through a bunch of XHTML documents and extract
some data, does anybody know a quick way to do this without sax making
outgoing network connections and fussing with DTDs?

BTW, the code to reproduce the error if anybody cares:
below is a document 'hello.html' produced by the W3C's Amaya:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1" />
<title>Hello World</title>
<meta name="generator" content="amaya 8.5, see
http://www.w3.org/Amaya/" />
</head>

<body>
<p>hello world!</p>
</body>
</html>

and the script:

import xml.sax.handler
xml.sax.parse("hello.html",
xml.sax.handler.ContentHandler()
)

the error:

SAXParseException:
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-framework-1.mod:89:0:
error in processing external entity reference

will be thrown.

Ouch. I took a brief look at this and expat has a problem here. I
should note that there are few more hairy stress tests of DTD
conformance than XHTMLMOD (the basis of XHTML 1.1).

Using the most recent expat, 1.95.8, something weird happens:

[uogbuji@borgia xmlwf]$ xmlwf -p ~/foo.xhtml
/home/uogbuji/http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd: No such
file or directory
/home/uogbuji/foo.xhtml:3:52: error in processing external entity
reference

It's a little confused about the fact that http:// starts a URL. I
tried as much fiddling as I had time to, but I think there's little
recourse but for you to submit a bug report to the expat project:

http://sourceforge.net/tracker/?group_id=10127&atid=110127

And change your DTD to use XHTML 1.0 (which *does* work with expat)
rather than 1.1

Good luck.

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Decomposition, Process, Recomposition -
http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google -
http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" -
http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML -
http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards -
http://www-106.ibm.com/developerworks/xml/library/x-stand4/

XHTML - how extend/create ELEMENT body in my DTD?	0	Oct 29, 2019
DOCTYPE + SAX	2	Apr 9, 2011
Help with code	0	Jun 12, 2022
Help with my responsive home page	2	Dec 14, 2022
problem with fprintf() output	5	Dec 5, 2010
Need help with programming in python for class (beginner level)	7	Nov 30, 2013
Why is ASP.NET changing character encoding of documents?	2	Oct 17, 2005
getElementsByTagName help please	5	Mar 18, 2011

processing XHTML1.1 documents with xml.sax

webworldL

Uche Ogbuji

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads