R
Rodrigo Meza
Hello Everyone
For a project I am working on, I need to retrieve links from html
documents. The easy part is to obtain 'plain' links like <A
HREF="http://site/path/document">, but when those links are
javascript'ized, the only robust solution needs to load the javascript
and dom document representation in the same way that browsers do. For
example, links in the form:
<A HREF="javascript:function_declared_before("arguments"));>
First I though that using spidermonkey (the mozilla javascript
interpreter) should be enough, but in that case, I dont have the
document structure elements (like document, window, document.history,
document.form.element, etc), so I tried parsing the document using a
library to build a tree representation of it, but that leads me to the
same problem again, that is, I have to represent all tree nodes as
javascript entities.
Anybody here have worked on a similar problem? What tools do you
think I should take a look?
Thanks in advance!
Rodrigo.
For a project I am working on, I need to retrieve links from html
documents. The easy part is to obtain 'plain' links like <A
HREF="http://site/path/document">, but when those links are
javascript'ized, the only robust solution needs to load the javascript
and dom document representation in the same way that browsers do. For
example, links in the form:
<A HREF="javascript:function_declared_before("arguments"));>
First I though that using spidermonkey (the mozilla javascript
interpreter) should be enough, but in that case, I dont have the
document structure elements (like document, window, document.history,
document.form.element, etc), so I tried parsing the document using a
library to build a tree representation of it, but that leads me to the
same problem again, that is, I have to represent all tree nodes as
javascript entities.
Anybody here have worked on a similar problem? What tools do you
think I should take a look?
Thanks in advance!
Rodrigo.