D
Dan Kohn
I've finally reimplemented the screen scraper I mentioned on
<http://groups.google.com/group/comp...bd4a9e48277/396cb7ea35eab14f#396cb7ea35eab14f>
using regexes and no external libraries. It is, as Daz suggested, many
times faster than REXML. My question is whether it would be smarter
(faster?, easier to code?) to use htmltools or HTMLTree:arser
instead.
Any other comments on ways to make the code faster, cleaner, and more
Ruby-like? Finally, can you please tell me why I can't get strip to
work, if I switch the commenting for lines 15 and 16? (It doesn't
remove the leading space in the second element of the last 6 lines.)
By contrast, the gsub on line 15 does what I want.
Thanks very much in advance for any advice you can offer on which tools
to use.
# The program parses out all of the rows and then looks
# for the right kinds of cells inside. It constructs
# 2 two-dimensional arrays of the results.
require 'mechanize'
agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
page = agent.get('http://www.dankohn.com/uamileage.html').body
def table_clean (table)
table.each { |row|
row.each { |e|
e.gsub!(/<.*?>| /m,"")
e.gsub!(/\s+/," ")
e.gsub!(/(^\s|\s$)/,"")
#~ e.strip
}
}
end
miletable = []
summarytable = []
row = /<tr>(.*?)<\/tr>/m
milecells = /
<td.*?class="t4">(.*?)<\/td>\s*
<td.*?class="t4">(.*?)<\/td>\s*
<td.*?class="t4">(.*?)<\/td>\s*
<td.*?>(.*?)<\/td>\s*
<td.*?class="t4">(.*?)<\/td>
/mx
summarycells = /
<td.*?class="t3".*?>(.*?)<\/td>\s*
<td.*?class="t3".*?>(.*?)<\/td>
/mx
activitycells = /
<td.*?class="t4".*?>(.*?)<\/td>\s*
<td.*?colspan=("4"|4).*?>(.*?)<\/td>
/mx
page.scan(row) { |e|
rowtext = e.to_s
rowtext.scan(milecells) {
miletable << [$1,$2,$3,$4,$5]
}
rowtext.scan(summarycells) {
summarytable << [$1,$2]
}
rowtext.scan(activitycells) {
summarytable << [$1,$3]
}
}
table_clean(miletable)
table_clean(summarytable)
miletable.each {|e| print e.join(":"),"\n"}
summarytable.each {|e| print e.join(":"),"\n"}
- dan
<http://groups.google.com/group/comp...bd4a9e48277/396cb7ea35eab14f#396cb7ea35eab14f>
using regexes and no external libraries. It is, as Daz suggested, many
times faster than REXML. My question is whether it would be smarter
(faster?, easier to code?) to use htmltools or HTMLTree:arser
instead.
Any other comments on ways to make the code faster, cleaner, and more
Ruby-like? Finally, can you please tell me why I can't get strip to
work, if I switch the commenting for lines 15 and 16? (It doesn't
remove the leading space in the second element of the last 6 lines.)
By contrast, the gsub on line 15 does what I want.
Thanks very much in advance for any advice you can offer on which tools
to use.
# The program parses out all of the rows and then looks
# for the right kinds of cells inside. It constructs
# 2 two-dimensional arrays of the results.
require 'mechanize'
agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
page = agent.get('http://www.dankohn.com/uamileage.html').body
def table_clean (table)
table.each { |row|
row.each { |e|
e.gsub!(/<.*?>| /m,"")
e.gsub!(/\s+/," ")
e.gsub!(/(^\s|\s$)/,"")
#~ e.strip
}
}
end
miletable = []
summarytable = []
row = /<tr>(.*?)<\/tr>/m
milecells = /
<td.*?class="t4">(.*?)<\/td>\s*
<td.*?class="t4">(.*?)<\/td>\s*
<td.*?class="t4">(.*?)<\/td>\s*
<td.*?>(.*?)<\/td>\s*
<td.*?class="t4">(.*?)<\/td>
/mx
summarycells = /
<td.*?class="t3".*?>(.*?)<\/td>\s*
<td.*?class="t3".*?>(.*?)<\/td>
/mx
activitycells = /
<td.*?class="t4".*?>(.*?)<\/td>\s*
<td.*?colspan=("4"|4).*?>(.*?)<\/td>
/mx
page.scan(row) { |e|
rowtext = e.to_s
rowtext.scan(milecells) {
miletable << [$1,$2,$3,$4,$5]
}
rowtext.scan(summarycells) {
summarytable << [$1,$2]
}
rowtext.scan(activitycells) {
summarytable << [$1,$3]
}
}
table_clean(miletable)
table_clean(summarytable)
miletable.each {|e| print e.join(":"),"\n"}
summarytable.each {|e| print e.join(":"),"\n"}
- dan