Screen scraping via regex vs. htmltools (vs. REXML)

D

Dan Kohn

I've finally reimplemented the screen scraper I mentioned on
<http://groups.google.com/group/comp...bd4a9e48277/396cb7ea35eab14f#396cb7ea35eab14f>
using regexes and no external libraries. It is, as Daz suggested, many
times faster than REXML. My question is whether it would be smarter
(faster?, easier to code?) to use htmltools or HTMLTree::parser
instead.

Any other comments on ways to make the code faster, cleaner, and more
Ruby-like? Finally, can you please tell me why I can't get strip to
work, if I switch the commenting for lines 15 and 16? (It doesn't
remove the leading space in the second element of the last 6 lines.)
By contrast, the gsub on line 15 does what I want.

Thanks very much in advance for any advice you can offer on which tools
to use.


# The program parses out all of the rows and then looks
# for the right kinds of cells inside. It constructs
# 2 two-dimensional arrays of the results.

require 'mechanize'
agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
page = agent.get('http://www.dankohn.com/uamileage.html').body

def table_clean (table)
table.each { |row|
row.each { |e|
e.gsub!(/<.*?>|&nbsp;/m,"")
e.gsub!(/\s+/," ")
e.gsub!(/(^\s|\s$)/,"")
#~ e.strip
}
}
end

miletable = []
summarytable = []
row = /<tr>(.*?)<\/tr>/m
milecells = /
<td.*?class="t4">(.*?)<\/td>\s*
<td.*?class="t4">(.*?)<\/td>\s*
<td.*?class="t4">(.*?)<\/td>\s*
<td.*?>(.*?)<\/td>\s*
<td.*?class="t4">(.*?)<\/td>
/mx
summarycells = /
<td.*?class="t3".*?>(.*?)<\/td>\s*
<td.*?class="t3".*?>(.*?)<\/td>
/mx
activitycells = /
<td.*?class="t4".*?>(.*?)<\/td>\s*
<td.*?colspan=("4"|4).*?>(.*?)<\/td>
/mx
page.scan(row) { |e|
rowtext = e.to_s
rowtext.scan(milecells) {
miletable << [$1,$2,$3,$4,$5]
}
rowtext.scan(summarycells) {
summarytable << [$1,$2]
}
rowtext.scan(activitycells) {
summarytable << [$1,$3]
}
}
table_clean(miletable)
table_clean(summarytable)
miletable.each {|e| print e.join(":"),"\n"}
summarytable.each {|e| print e.join(":"),"\n"}


- dan
 
J

James Britt

Dan said:
I've finally reimplemented the screen scraper I mentioned on
<http://groups.google.com/group/comp...bd4a9e48277/396cb7ea35eab14f#396cb7ea35eab14f>
using regexes and no external libraries. It is, as Daz suggested, many
times faster than REXML. My question is whether it would be smarter
(faster?, easier to code?) to use htmltools or HTMLTree::parser
instead.

The code in your post seems to use Mechanize.
If you are using agent.get to fetch the HTML then you've already parsed
the html using htmltools & REXML. You can register callback objects
that are invoked when the parsing process encounters matching nodes.
Mechanize does this automatically for certain nodes (form stuff, I
think), but you can use watch_for_set= {} to define a set of nodes to
watch for.

This is what I use to construct the product pages for rubystuff.com from
the multiple CafePress pages that contain the images, prices, and
product description. I tell Mechanize to watch for img, tr, and td
elements, and it constructs sets of custom objects of just the parts of
the source HTML matching certain criteria. Then I extract the data,
create RSS feeds, and turn those into a set of aggregated HTML pages.

What I like about this is that the parse process gives me business
objects, with (hopefully) self-explanatory behavior. For example, I can
ask one of these objects for 'product_id' or 'description'; the object
encapsulates the assorted XPath/regex code needed to get that from the
source HTML node, making the main part of the app easier to maintain.


James Britt

--

http://www.ruby-doc.org - Ruby Help & Documentation
http://www.artima.com/rubycs/ - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools
 
J

James Britt

Dan said:
Thanks for the response, James. My next question was actually about
debugging Mechanize
<http://groups.google.com/group/comp.lang.ruby/msg/04fc7473b08c16fc>.
Would you mind emailing me your scraping code, as I've been suffering
from a lack of examples to copy?

Also, are you sure Mechanize parses the whole page with get? It
doesn't wait for a find?

Don't think so, but I might be wrong. My code calls agent get, then
goes right into looping over the collected nodes.

I'll see about putting my code together as an example.

As for debugging Mechanize, I've found it helpful to go to the lib
source and stick in some STDERR.puts calls to inspect request and
response data to be sure things are getting passed around as expected.

After that, unit tests are helpful.



James
--

http://www.ruby-doc.org - Ruby Help & Documentation
http://www.artima.com/rubycs/ - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools
 
J

James Britt

itsme213 said:
Any chance you could make that code available? Sounds like a useful example.

Is Mechanize also a good option for writing acceptance tests, compared to
Watir?

WATIR exposes the HTML DOM as seen by IE, which is not the raw HTML
source returned from the server (but perhaps someone more up on the
latest WATIR knows otherwise). Mechanize will get you the source HTML,
albeit sanitized for REXML parsing.

I find WATIR most useful for walking though a series of pages where
automated typing and clicking is essential. Pretty much every Web app
I've written in the last 9 months uses WATIR (plus my own custom DSL on
top of it) for functional testing. Major time saver.

I use Mechanize for data snarfing and occasional feed building.



James Britt



--

http://www.ruby-doc.org - Ruby Help & Documentation
http://www.artima.com/rubycs/ - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top