HTML parser to DOM via SAX?

Rogan Dawes · Mar 7, 2005

Hi folks,

I am trying to build an "advanced" spider, with support for
javascript/DHTML links.

In order to do this, I'm trying to find a way of parsing an HTML page to
a DOM, but at the same time allowing javascript to execute, if needed.

It seems that the best way to do this is to parse the HTML using a
SAX-like html parser (e.g. tagsoup), check the tag that it is processing
at each point to see if it is script related, and if so, process the tag
(e.g. source the script from the provided URL, or evaluate the inline
script), before passing the SAX event to a class that builds the DOM.

I imagine the following structure:

Create a JavaScript interpreter (Rhino)
Create a new empty W3C Document, pass that to Rhino, so that script
calls that reference "document" will have something to work with.
Create a reader from the HTML source, and use that as input to the
TagSoup parser.

Register a SAX handler with the tagsoup parser that checks to see if the
tag is script related, and if there is anything script related that
needs doing, e.g. sourcing the script in the Rhino interpreter, and then
passes the event to a SAX handler that actually takes the SAX event and
creates the corresponding DOM element.

When the parser has finished, identify any "onload" events (or other
events that should be executed post-load), and use the Rhino interpreter
to execute them.

So, the missing bits are basically the logic that checks if a tag is
script related, and the actual document builder that converts SAX events
into dom elements.

Firstly, does this sound like a reasonable approach?

Secondly, does anyone know of any GPL-compatible implementations of a
SAX to DOM convertor? Or, is there some (hidden?) interface in JAXP that
supports this functionality?

Thanks

Rogan

P.S. This will eventually make its way into WebScarab, a GPL web
application security analysis tool. More info at
http://www.owasp.org/software/webscarab.html

How to implement a html parser in java?	1	Dec 28, 2023
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023
Why is SAX faster than DOM?	4	Jun 3, 2012
How can I add React 18 to existing HTML?	2	Mar 27, 2023
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023
How to have two html audio players on one page?	0	May 3, 2022
SAX Parser problem	12	Nov 13, 2006
How to push data from one HTML page to another	4	Jan 3, 2024

HTML parser to DOM via SAX?

Rogan Dawes

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads