Scraping <table> from a website

cskilbeck · Nov 19, 2007

Hi,

I need to extract everything between <table> and </table> on a website
(there's only one table on the page. So far I have:

require 'open-uri'
page = open('http://xxx.html').read
page.gsub!(/\n/,"")
page.gsub!(/\r/,"")
inner = page.scan(%r{.*<table.*>(.*)</table>.*}m)
print inner

but inner is empty - any ideas?

If I substitute line 2 with

page = '123<table>456</table>789

I get inner = 456, which is correct.

Alex LeDonne · Nov 19, 2007

Hi,

I need to extract everything between <table> and </table> on a website
(there's only one table on the page. So far I have:

require 'open-uri'
page = open('http://xxx.html').read
page.gsub!(/\n/,"")
page.gsub!(/\r/,"")
inner = page.scan(%r{.*<table.*>(.*)</table>.*}m)

Untested, but try:

inner = page.scan(%r{.* said:
print inner

but inner is empty - any ideas?

If I substitute line 2 with

page = '123<table>456</table>789

I get inner = 456, which is correct.

If you try page = '123<table><tr><td>456</td></tr></table>789', it
will fail again.

You only want to capture up to the next closing angle bracket. What's
happening is that the second .* is matching the contents of the entire
table, up to the closing angle bracket of the last tag (probably
</tr>) right before the </table>, and inner gets only the leftover
whitespace inbetween. So only capture characters that are NOT a
closing angle bracket.

-Alex

Rolando Abarca · Nov 19, 2007

Hi,

I need to extract everything between <table> and </table> on a website
(there's only one table on the page. So far I have:

require 'open-uri'
page = open('http://xxx.html').read
page.gsub!(/\n/,"")
page.gsub!(/\r/,"")
inner = page.scan(%r{.*<table.*>(.*)</table>.*}m)
print inner

but inner is empty - any ideas?

If I substitute line 2 with

page = '123<table>456</table>789

I get inner = 456, which is correct.

use the right tools for the right job

require 'hpricot'
require 'open-uri'

doc = Hpricot(open('http://xxx.html'))
table = doc.at('table')
puts table.inner_html

(not tested)
regards,

William James · Nov 19, 2007

Hi,

I need to extract everything between <table> and </table> on a website
(there's only one table on the page. So far I have:

require 'open-uri'
page = open('http://xxx.html').read
page.gsub!(/\n/,"")
page.gsub!(/\r/,"")
inner = page.scan(%r{.*<table.*>(.*)</table>.*}m)
print inner

but inner is empty - any ideas?

If I substitute line 2 with

page = '123<table>456</table>789

I get inner = 456, which is correct.

inner = page[ %r{<table.*?>(.*?)</table>}mi, 1]

cskilbeck · Nov 19, 2007

Hi,

Click to expand...

I need to extract everything between <table> and </table> on a website
(there's only one table on the page. So far I have:

Click to expand...

require 'open-uri'
page = open('http://xxx.html').read
page.gsub!(/\n/,"")
page.gsub!(/\r/,"")
inner = page.scan(%r{.*<table.*>(.*)</table>.*}m)
print inner

Click to expand...

but inner is empty - any ideas?

Click to expand...

If I substitute line 2 with

Click to expand...

page = '123<table>456</table>789

Click to expand...

I get inner = 456, which is correct.

Click to expand...

inner = page[ %r{<table.*?>(.*?)</table>}mi, 1]

Thanks all for your help. non greedy matching is the key.

Thufir · Nov 20, 2007

require 'hpricot'
require 'open-uri'

doc = Hpricot(open('http://xxx.html')) table = doc.at('table')
puts table.inner_html

Amazing -- I thought that the above would be a massive project, not what
appears to be pseudo-code! Not everything in Ruby is magically easy, but
the above is pretty good

-Thufir

website screen scraping with Mechanize or Rubyful Soup	9	Sep 12, 2005
Screen scraping via regex vs. htmltools (vs. REXML)	4	Dec 2, 2005
mechanize - extract href	11	Oct 16, 2010
Need help with this script	4	Mar 12, 2023
Twitter Bot for Series recommendations help please	1	Oct 2, 2024
Why 'files.py' does not print the filenames into a table format?	32	Jun 15, 2013
Using Nokogiri	17	Nov 8, 2009
Extracting links from a html table	1	May 19, 2008

Scraping <table> from a website

cskilbeck

Alex LeDonne

Rolando Abarca

William James

cskilbeck

Thufir

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads