R
Rogan Dawes
Hi folks,
I am trying to build an "advanced" spider, with support for
javascript/DHTML links.
In order to do this, I'm trying to find a way of parsing an HTML page to
a DOM, but at the same time allowing javascript to execute, if needed.
It seems that the best way to do this is to parse the HTML using a
SAX-like html parser (e.g. tagsoup), check the tag that it is processing
at each point to see if it is script related, and if so, process the tag
(e.g. source the script from the provided URL, or evaluate the inline
script), before passing the SAX event to a class that builds the DOM.
I imagine the following structure:
Create a JavaScript interpreter (Rhino)
Create a new empty W3C Document, pass that to Rhino, so that script
calls that reference "document" will have something to work with.
Create a reader from the HTML source, and use that as input to the
TagSoup parser.
Register a SAX handler with the tagsoup parser that checks to see if the
tag is script related, and if there is anything script related that
needs doing, e.g. sourcing the script in the Rhino interpreter, and then
passes the event to a SAX handler that actually takes the SAX event and
creates the corresponding DOM element.
When the parser has finished, identify any "onload" events (or other
events that should be executed post-load), and use the Rhino interpreter
to execute them.
So, the missing bits are basically the logic that checks if a tag is
script related, and the actual document builder that converts SAX events
into dom elements.
Firstly, does this sound like a reasonable approach?
Secondly, does anyone know of any GPL-compatible implementations of a
SAX to DOM convertor? Or, is there some (hidden?) interface in JAXP that
supports this functionality?
Thanks
Rogan
P.S. This will eventually make its way into WebScarab, a GPL web
application security analysis tool. More info at
http://www.owasp.org/software/webscarab.html
I am trying to build an "advanced" spider, with support for
javascript/DHTML links.
In order to do this, I'm trying to find a way of parsing an HTML page to
a DOM, but at the same time allowing javascript to execute, if needed.
It seems that the best way to do this is to parse the HTML using a
SAX-like html parser (e.g. tagsoup), check the tag that it is processing
at each point to see if it is script related, and if so, process the tag
(e.g. source the script from the provided URL, or evaluate the inline
script), before passing the SAX event to a class that builds the DOM.
I imagine the following structure:
Create a JavaScript interpreter (Rhino)
Create a new empty W3C Document, pass that to Rhino, so that script
calls that reference "document" will have something to work with.
Create a reader from the HTML source, and use that as input to the
TagSoup parser.
Register a SAX handler with the tagsoup parser that checks to see if the
tag is script related, and if there is anything script related that
needs doing, e.g. sourcing the script in the Rhino interpreter, and then
passes the event to a SAX handler that actually takes the SAX event and
creates the corresponding DOM element.
When the parser has finished, identify any "onload" events (or other
events that should be executed post-load), and use the Rhino interpreter
to execute them.
So, the missing bits are basically the logic that checks if a tag is
script related, and the actual document builder that converts SAX events
into dom elements.
Firstly, does this sound like a reasonable approach?
Secondly, does anyone know of any GPL-compatible implementations of a
SAX to DOM convertor? Or, is there some (hidden?) interface in JAXP that
supports this functionality?
Thanks
Rogan
P.S. This will eventually make its way into WebScarab, a GPL web
application security analysis tool. More info at
http://www.owasp.org/software/webscarab.html