W
webworldL
Has anybody had any luck processing XHTML1.1 documents with xml.sax?
Whenever I try it, python loads the W3C DTD from the top, then crashes
saying that there's an error in the external DTD.
All I need to do is rip through a bunch of XHTML documents and extract
some data, does anybody know a quick way to do this without sax making
outgoing network connections and fussing with DTDs?
BTW, the code to reproduce the error if anybody cares:
below is a document 'hello.html' produced by the W3C's Amaya:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1" />
<title>Hello World</title>
<meta name="generator" content="amaya 8.5, see
http://www.w3.org/Amaya/" />
</head>
<body>
<p>hello world!</p>
</body>
</html>
and the script:
import xml.sax.handler
xml.sax.parse("hello.html",
xml.sax.handler.ContentHandler()
)
the error:
SAXParseException:
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-framework-1.mod:89:0:
error in processing external entity reference
will be thrown.
Whenever I try it, python loads the W3C DTD from the top, then crashes
saying that there's an error in the external DTD.
All I need to do is rip through a bunch of XHTML documents and extract
some data, does anybody know a quick way to do this without sax making
outgoing network connections and fussing with DTDs?
BTW, the code to reproduce the error if anybody cares:
below is a document 'hello.html' produced by the W3C's Amaya:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1" />
<title>Hello World</title>
<meta name="generator" content="amaya 8.5, see
http://www.w3.org/Amaya/" />
</head>
<body>
<p>hello world!</p>
</body>
</html>
and the script:
import xml.sax.handler
xml.sax.parse("hello.html",
xml.sax.handler.ContentHandler()
)
the error:
SAXParseException:
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-framework-1.mod:89:0:
error in processing external entity reference
will be thrown.