P
Peter Szinek
This is long overdue (0.2.8 is out for about a week already), but
anyway, here we go:
============
What's this?
============
scRUBYt! is a very easy to learn and use, yet powerful Web scraping
framework based on Hpricot and mechanize. It's purpose is to free you
from the drudgery of web page crawling, looking up HTML tags,
attributes, XPaths, form names and other typical low-level web scraping
woes by figuring these out from your examples copy'n'pasted from the Web
page.
=========
CHANGELOG
=========
[NEW] download pattern: download the file pointed to by the
parent pattern
[NEW] checking checkboxes
[NEW] basic authentication support
[NEW] default values for missing elements (basic version)
[NEW] possibility to resolve relative paths against a custom url
[NEW] first simple version of to_csv and to_hash
[NEW] complete rewrite of the exporting system (Credit: Neelance)
[NEW] first version of smart regular expressions: they are constructed
from examples, just as regular expressions (Credit: Neelance)
[NEW] Possibility to click the n-th link
[FIX] Clicking on links using scRUBYt's advanced example lookup (i.e.
you can use :begins_with etc.)
[NEW] Forcing writing text of non-leaf nodes with :write_text => true
[NEW] Possibility to set custom user-agent; Specified default user agent
as Konqueror
[FIX] Fixed crawling to detail pages in case of leaving the
original site (Credit: Michael Mazour)
[FIX] fixing the '//' problem - if the relative url contained two
slashes, the fetching failed
[FIX] scrubyt assumed that documents have a list of nested elements
(Credit: Rick Bradley)
[FIX] crawling to detail pages works also if the parent pattern is
a string pattern
[FIX] shorcut url fixed again
[FIX] regexp pattern fixed in case it's parent was a string
[FIX] refactoring the core classes, lots of bugfixes and stabilization
=============
Misc comments
=============
As of 0.2.8, scRUBYt! depends on ParseTree and Ruby2Ruby - unfortunately
it seems ParseTree is not that trivial to set up on windows. However, we
are currently working on a new project to solve this problem, and we are
making quite good progress so I believe for the next release, 0.3.0,
this obstacle will be blown away. Until then windows users should either
install scRUBYt! on cygwin, install ParseTree somehow or use 0.2.6 until
we are ready with the Ruby bridge to ParseTree which will make the
installation on windows possible without the need to compile C.
Please continue to report problems, discuss things or give any kind of
feedback on the scRUBYt! forum at
http://agora.scrubyt.org
Cheers,
Peter - on the behalf of the scRUBYt! devel team
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.
anyway, here we go:
============
What's this?
============
scRUBYt! is a very easy to learn and use, yet powerful Web scraping
framework based on Hpricot and mechanize. It's purpose is to free you
from the drudgery of web page crawling, looking up HTML tags,
attributes, XPaths, form names and other typical low-level web scraping
woes by figuring these out from your examples copy'n'pasted from the Web
page.
=========
CHANGELOG
=========
[NEW] download pattern: download the file pointed to by the
parent pattern
[NEW] checking checkboxes
[NEW] basic authentication support
[NEW] default values for missing elements (basic version)
[NEW] possibility to resolve relative paths against a custom url
[NEW] first simple version of to_csv and to_hash
[NEW] complete rewrite of the exporting system (Credit: Neelance)
[NEW] first version of smart regular expressions: they are constructed
from examples, just as regular expressions (Credit: Neelance)
[NEW] Possibility to click the n-th link
[FIX] Clicking on links using scRUBYt's advanced example lookup (i.e.
you can use :begins_with etc.)
[NEW] Forcing writing text of non-leaf nodes with :write_text => true
[NEW] Possibility to set custom user-agent; Specified default user agent
as Konqueror
[FIX] Fixed crawling to detail pages in case of leaving the
original site (Credit: Michael Mazour)
[FIX] fixing the '//' problem - if the relative url contained two
slashes, the fetching failed
[FIX] scrubyt assumed that documents have a list of nested elements
(Credit: Rick Bradley)
[FIX] crawling to detail pages works also if the parent pattern is
a string pattern
[FIX] shorcut url fixed again
[FIX] regexp pattern fixed in case it's parent was a string
[FIX] refactoring the core classes, lots of bugfixes and stabilization
=============
Misc comments
=============
As of 0.2.8, scRUBYt! depends on ParseTree and Ruby2Ruby - unfortunately
it seems ParseTree is not that trivial to set up on windows. However, we
are currently working on a new project to solve this problem, and we are
making quite good progress so I believe for the next release, 0.3.0,
this obstacle will be blown away. Until then windows users should either
install scRUBYt! on cygwin, install ParseTree somehow or use 0.2.6 until
we are ready with the Ruby bridge to ParseTree which will make the
installation on windows possible without the need to compile C.
Please continue to report problems, discuss things or give any kind of
feedback on the scRUBYt! forum at
http://agora.scrubyt.org
Cheers,
Peter - on the behalf of the scRUBYt! devel team
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.