T
Tom N
I have some existing web scraping Java code which uses an old
sourceforge project called jacobie (now an orphan and only supports
IE6).
I'm looking for a recommended open-source replacement, one that is
popular and easy to use (popular means unlikely to become an orphan).
My existing code uses Jacobie to navigate the web, and then has ad-hoc
HTML parsing code in Java. Jacobie is a Java class library that uses
the jacob project to drive Internet Explorer. Jacobie is sparsely
documented and any attempts by me to understand or modify it become
bogged down in the inherently un-understandable Windows/IE interface -
yuk.
The web navigation required by my code is fairly straightforward and
will have to be rewritten for a new libray (no big deal) but I don't
want to rewrite the parsing so I need something that will allow easy
access to the raw HTML.
On the other hand, for the future, something that allows some sort of
scriptable parsing of the web content would be good.
I'd prefer to drive an actual browser rather than using a virtual
browser (or the option of both would be good).
Currently thinking that JWebUnit[1] looks like a good candidate.
JWebUnit drives HtmlUnit[2] ("GUI-Less browser for Java programs").
There is a work-in-progress to provide an HtmlUnit interface for
Selenium[3] (front end for multiple real browsers, Firefox and IE
included) - unclear how usable this currently is.
I also came across webdriver [9], which sounds similar in concept.
Seems to support IE, Firefox, HtmlUnit I get the feeling that it is a
less broadly supported project than JWebUnit.
I don't know if these approaches support any kind of scriptable parsing.
Perhaps that is a separate issue because I could easily use a completely
separate tool to parse the HTML once navigated and retrieved.
Having a quick read of the web page for Web-Harvest[4] suggests it may
be a good avenue for future parsing with less pain than current ad-hoc
code.
Any comments/suggestions?
Browsers:
Firefox seems to be the most likely host browser - I have no particular
need to use to any specific browser (other than it being the latest
version of that browser).
Currently I am using IE6 with jacobie. Also have installed Firefox
(latest), Chrome (latest) and Opera (latest). Obviously, IE6 is getting
long in the tooth with IE8 out now. I have not upgraded to IE7 or IE8
because I am not sure whether jacobie will work with them. At some
stage, I'd like to move from Win XP to Windows 7 but I don't want to use
anything proprietary so moving to Windows 7 should not be an issue
(apart from Win7 likely not supporting IE6 or IE7).
Development tools: NetBeans 6.5
Platform: Windows XP.
[1] http://jwebunit.sourceforge.net/
[2] http://htmlunit.sourceforge.net/
[3] http://seleniumhq.org/projects/remote-control/
[4] http://web-harvest.sourceforge.net/
[5] http://simile.mit.edu/wiki/Solvent
[6] http://simile.mit.edu/wiki/Piggy_Bank
[7] http://en.wikipedia.org/wiki/Xquery
[8] http://en.wikipedia.org/wiki/Xpath
[9] http://code.google.com/p/webdriver/
sourceforge project called jacobie (now an orphan and only supports
IE6).
I'm looking for a recommended open-source replacement, one that is
popular and easy to use (popular means unlikely to become an orphan).
My existing code uses Jacobie to navigate the web, and then has ad-hoc
HTML parsing code in Java. Jacobie is a Java class library that uses
the jacob project to drive Internet Explorer. Jacobie is sparsely
documented and any attempts by me to understand or modify it become
bogged down in the inherently un-understandable Windows/IE interface -
yuk.
The web navigation required by my code is fairly straightforward and
will have to be rewritten for a new libray (no big deal) but I don't
want to rewrite the parsing so I need something that will allow easy
access to the raw HTML.
On the other hand, for the future, something that allows some sort of
scriptable parsing of the web content would be good.
I'd prefer to drive an actual browser rather than using a virtual
browser (or the option of both would be good).
Currently thinking that JWebUnit[1] looks like a good candidate.
JWebUnit drives HtmlUnit[2] ("GUI-Less browser for Java programs").
There is a work-in-progress to provide an HtmlUnit interface for
Selenium[3] (front end for multiple real browsers, Firefox and IE
included) - unclear how usable this currently is.
I also came across webdriver [9], which sounds similar in concept.
Seems to support IE, Firefox, HtmlUnit I get the feeling that it is a
less broadly supported project than JWebUnit.
I don't know if these approaches support any kind of scriptable parsing.
Perhaps that is a separate issue because I could easily use a completely
separate tool to parse the HTML once navigated and retrieved.
Having a quick read of the web page for Web-Harvest[4] suggests it may
be a good avenue for future parsing with less pain than current ad-hoc
code.
Any comments/suggestions?
Browsers:
Firefox seems to be the most likely host browser - I have no particular
need to use to any specific browser (other than it being the latest
version of that browser).
Currently I am using IE6 with jacobie. Also have installed Firefox
(latest), Chrome (latest) and Opera (latest). Obviously, IE6 is getting
long in the tooth with IE8 out now. I have not upgraded to IE7 or IE8
because I am not sure whether jacobie will work with them. At some
stage, I'd like to move from Win XP to Windows 7 but I don't want to use
anything proprietary so moving to Windows 7 should not be an issue
(apart from Win7 likely not supporting IE6 or IE7).
Development tools: NetBeans 6.5
Platform: Windows XP.
[1] http://jwebunit.sourceforge.net/
[2] http://htmlunit.sourceforge.net/
[3] http://seleniumhq.org/projects/remote-control/
[4] http://web-harvest.sourceforge.net/
[5] http://simile.mit.edu/wiki/Solvent
[6] http://simile.mit.edu/wiki/Piggy_Bank
[7] http://en.wikipedia.org/wiki/Xquery
[8] http://en.wikipedia.org/wiki/Xpath
[9] http://code.google.com/p/webdriver/