E
erik_midtskogen
Hi Folks,
I'm writing a general-purpose HTML screen-scraping framework in Java
(scrape new web sites without writing new code, yada yada...), and I
want to use the JAXP DOM api along with XPath and XSLT for most of my
business logic. I actually hope to make this an open-source project if
I can ever get it to some reasonable level of usability.
My problem is that, since the slurry pumped out by most web sites bears
only the faintest resemblance to HTML--let alone XML--I need to use a
special-purpose SAX parser that is intentionally not fully SAX
compliant (since it accepts malformed documents).
I already know how to set the system property for an arbitrary SAX
parser when programming to the SAX API (i.e. when calling
SAXParserFactory.newInstance()), and I also know how to specify an
arbitrary DocumentBuilderFactory when using DOM. So, how do I specify
the SAX parser that I want DOM to use "behind the scenes"?
My expectation was that the JAXP DOM implementation should be a client
of the JAXP SAX implementation. I could be wrong about this, though.
I'm looking at the code now, and although it's a bit hard to follow
(and my Eclipse debugger bugs out at just the wrong moment), it appears
as if the default JAXP DocumentBuilderFactory is hard-coded to use an
org.apache.xerces.parsers.XML11Configuration as a SAX parser. Weird.
I could be mistaken about this, but if it's true, then this is not my
idea of pluggability.
So here's where I am so far: I wrote a custom SAXParserFactory to
create an instance of my custom SAX parser, and I plugged it in and
tested it out using the SAX API and it worked just fine. But then when
I tried using the DOM API for my XPath/XSLT processing, specifying my
custom SAXParserFactory as before, I found that the JAXP DOM
implementation did not use the SAXParserFactory I had specified, and so
obviously, didn't use the SAX parser I wanted.
I could try building my own DocumentBuilderFactory, but that looks like
an awful lot of work just to plug in a SAX parser. Does anyone here
know of an easier way?
Much thanks in advance.
I'm writing a general-purpose HTML screen-scraping framework in Java
(scrape new web sites without writing new code, yada yada...), and I
want to use the JAXP DOM api along with XPath and XSLT for most of my
business logic. I actually hope to make this an open-source project if
I can ever get it to some reasonable level of usability.
My problem is that, since the slurry pumped out by most web sites bears
only the faintest resemblance to HTML--let alone XML--I need to use a
special-purpose SAX parser that is intentionally not fully SAX
compliant (since it accepts malformed documents).
I already know how to set the system property for an arbitrary SAX
parser when programming to the SAX API (i.e. when calling
SAXParserFactory.newInstance()), and I also know how to specify an
arbitrary DocumentBuilderFactory when using DOM. So, how do I specify
the SAX parser that I want DOM to use "behind the scenes"?
My expectation was that the JAXP DOM implementation should be a client
of the JAXP SAX implementation. I could be wrong about this, though.
I'm looking at the code now, and although it's a bit hard to follow
(and my Eclipse debugger bugs out at just the wrong moment), it appears
as if the default JAXP DocumentBuilderFactory is hard-coded to use an
org.apache.xerces.parsers.XML11Configuration as a SAX parser. Weird.
I could be mistaken about this, but if it's true, then this is not my
idea of pluggability.
So here's where I am so far: I wrote a custom SAXParserFactory to
create an instance of my custom SAX parser, and I plugged it in and
tested it out using the SAX API and it worked just fine. But then when
I tried using the DOM API for my XPath/XSLT processing, specifying my
custom SAXParserFactory as before, I found that the JAXP DOM
implementation did not use the SAXParserFactory I had specified, and so
obviously, didn't use the SAX parser I wanted.
I could try building my own DocumentBuilderFactory, but that looks like
an awful lot of work just to plug in a SAX parser. Does anyone here
know of an easier way?
Much thanks in advance.