How to get data from html table

V

Vikash Kumar

I want to store the values of a table in different variables, I have the
following table structure:

<table width="579">
<tr class="even">
<td class width="65">&nbsp;Case5-04</td>
<td class width="130">10/11/2006 23:24:33</td>
<td class width="61">Case5-04</td>
<td class width="32">1005</td>
<td class width="59">Sell</td>
<td class width="36">1,000</td>
<td class width="34">ARP</td>
<td class width="52">$36.90</td>
</tr>
<tr class="odd">
<td class width="65">&nbsp;Case5-03</td>
<td class width="130">10/11/2006 23:20:07</td>
<td class width="61">Case5-03</a></td>
<td class width="32">1005</td>
<td class width="59">Buy</td>
<td class width="36">1,500</td>
<td class width="34">ARP</td>
<td class width="52">$36.70</td>
</tr>
<tr class="even">
<td class width="65">&nbsp;Case4-04</td>
<td class width="130">10/11/2006 05:28:54</td>
<td class width="61">Case4-04</a></td>
<td class width="32">1004</td>
<td class width="59">Sell</td>
<td class width="36">300</td>
<td class width="34">RIL</td>
<td class width="52">$490.00</td>
</tr>
<tr class="odd">
<td class width="65">&nbsp;Case4-03</td>
<td class width="130">10/11/2006 05:21:32</td>
<td class width="61">Case4-03</a></td>
<td class width="32">1004</td>
<td class width="59">Buy</td>
<td class width="36">200</td>
<td class width="34">RIL</td>
<td class width="52">$489.90</td>
</tr>
</table>

I want to store the values in variables so that I can compare records.
Please help me out how to do this in ruby.
 
P

Peter Szinek

I want to store the values in variables so that I can compare records.
Please help me out how to do this in ruby.

One possible way:

Record = Struct.new("Record", :name, :date, :name_again, :some_num,
:buy_link, :some_num2, :letters, :price)
records = []

doc = Hpricot(doc)
stuff = doc/"/table/tr/td"

elements = stuff.map { |elem| elem.inner_html }.each_slice(8) do |slice|
records << Record.new(*slice)
end

p records.sort_by {|record| record.price.slice(1..record.size) }

Note that since I did not know the semantics of the table cells,
sometimes the Struct Record has some weird fields in it, but you get the
idea.


Also I am not 100% sure if the sort_by should not be done on to_f-d
prices (probably not due to rounding problems, but I wonder if there can
be some weird string issues, too).

HTH,
Peter

__
http://www.rubyrailways.com
 
P

Park Heesob

Hi,
From: Vikash Kumar <[email protected]>
Reply-To: (e-mail address removed)
To: (e-mail address removed) (ruby-talk ML)
Subject: How to get data from html table
Date: Mon, 27 Nov 2006 20:20:54 +0900

I want to store the values of a table in different variables, I have the
following table structure:

<table width="579">
<tr class="even">
<td class width="65">&nbsp;Case5-04</td>
<td class width="130">10/11/2006 23:24:33</td>
<td class width="61">Case5-04</td>
<td class width="32">1005</td>
<td class width="59">Sell</td>
<td class width="36">1,000</td>
<td class width="34">ARP</td>
<td class width="52">$36.90</td>
</tr>
<tr class="odd">
<td class width="65">&nbsp;Case5-03</td>
<td class width="130">10/11/2006 23:20:07</td>
<td class width="61">Case5-03</a></td>
<td class width="32">1005</td>
<td class width="59">Buy</td>
<td class width="36">1,500</td>
<td class width="34">ARP</td>
<td class width="52">$36.70</td>
</tr>
<tr class="even">
<td class width="65">&nbsp;Case4-04</td>
<td class width="130">10/11/2006 05:28:54</td>
<td class width="61">Case4-04</a></td>
<td class width="32">1004</td>
<td class width="59">Sell</td>
<td class width="36">300</td>
<td class width="34">RIL</td>
<td class width="52">$490.00</td>
</tr>
<tr class="odd">
<td class width="65">&nbsp;Case4-03</td>
<td class width="130">10/11/2006 05:21:32</td>
<td class width="61">Case4-03</a></td>
<td class width="32">1004</td>
<td class width="59">Buy</td>
<td class width="36">200</td>
<td class width="34">RIL</td>
<td class width="52">$489.90</td>
</tr>
</table>

I want to store the values in variables so that I can compare records.
Please help me out how to do this in ruby.
Here is another way:

After saving the html table text to file 'w.xml',
You can deal the value like this:

require 'rexml/document'
include REXML
doc = Document.new File.new("w.xml")
doc.elements.each("*/tr/td") {|e|
puts e.texts
}


Regards,

Park Heesob

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar - get it now!
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
 
P

Peter Szinek

Hello,
Digression: when solving a problem like this, it is often much easier to
write a few lines of HTML than to try to use a high-powered library to
accomplish it.

I don't see why is it an advantage here. The first solution in this thread:

-------------------------------------------------------------------
Record = Struct.new("Record", :name, :date, :name_again, :some_num,
:buy_link, :some_num2, :letters, :price)
records = []

cells = Hpricot(doc)/"/table/tr/td"

cells.map { |elem| elem.inner_html }.each_slice(8) do |slice|
records << Record.new(*slice)
end

p records.sort_by {|record| record.price.slice(1..record.size) }
------------------------------------------------------------------

is shorter, does not care about malformed HTML and even does the sorting
which I believe was the main intention of the OP. So why not use a
high-powered library?

Discalimer: that solution was actually mine but I am not referring to it
because of this, but rather because I think that parsing all the cells
with a one liner using a robust HTML parser is actually much better in
practice than to use a basic set of regexps and then patch the results
they yield with ad-hoc rules (missing close tags etc) looked up from 3
examples. I believe the above HPricot-powered solution will work with
100 records, too (if the other 97 does not get *really* messed up - but
in that case the regexps will fail miserably too) whereas the
we-do-not-need-any-high-powered-library approach may need another 25
patches due to the other errors in the 100-record HTML...

I do not argue that parsing the page with regexps and seeing what's
going on under the hood can provide a lot of experience, but I am really
sure that feeding a real life page to a HTML parser is safer than to use
the regexp approach.

Of course if this question is just a theoretical one, and there won't be
100 (or more than 3) records, just these 3, then forget about this mail.

Cheers,
Peter

__
http://www.rubyrailways.com
 
V

Vikash Kumar

#!/usr/bin/ruby -w
data = File.read(sourcefilename)

output = []

html_rows = data.scan(%r{<tr.*?>(.*?)</tr>}im).flatten

html_rows.each do |row|
# filter these undesired elements
row.gsub!("&nbsp;","")
row.gsub("</a>","")
cells = row.scan(%r{<td.*?>(.*?)</td>}im).flatten
output << cells
end

# done collecting, now display

output.each do |row|
line = row.join(",")
puts line
end

What will be right solution if some one wants to get the data from yahoo
site http://finance.yahoo.com/q?s=IBM and then displaying only some
values such as Prev Close, Last Trade. Lets suppose we go to the URL
through :

require 'watir'
include Watir
require 'hpricot'
include Hpricot
ie=Watir::IE.new
ie.goto("http://finance.yahoo.com/q?s=IBM")

Now, whats next. Also let suppose we want to get all the values of
table, we don't know the table structure then what what should be the
correct solution ?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,816
Latest member
nipsseyhussle

Latest Threads

Top