Scan HTML

Tom Arra · Mar 1, 2008

So I am new to Ruby scripting so I am not sure if this is possible or
not. I want to make a script that will load a webpage and then search
through the HTML of that page until it hits a certain tag. Once it hits
that tag it need to grab all of the text between the tag and the
appropriate end tag. Is something like this possible?

Example
<html>
<body>
<h3>test</h3>
</body>
</html>

I want the script to return "test"

Gregory Seidman · Mar 1, 2008

So I am new to Ruby scripting so I am not sure if this is possible or
not. I want to make a script that will load a webpage and then search
through the HTML of that page until it hits a certain tag. Once it hits
that tag it need to grab all of the text between the tag and the
appropriate end tag. Is something like this possible?

Example
<html>
<body>
<h3>test</h3>
</body>
</html>

I want the script to return "test"

You want the Hpricot gem.

require 'rubygems'
require 'hpricot'

html = <<EOF
<html>
<body>
<h3>test</h3>
</body>
</html>
EOF

doc = Hpricot(html)

puts (doc/'h3').first.inner_text

--Greg

William James · Mar 1, 2008

So I am new to Ruby scripting so I am not sure if this is possible or
not. I want to make a script that will load a webpage and then search
through the HTML of that page until it hits a certain tag. Once it hits
that tag it need to grab all of the text between the tag and the
appropriate end tag. Is something like this possible?

Example
<html>
<body>
<h3>test</h3>
</body>
</html>

I want the script to return "test"

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]

If the tag can contain attributes, e.g.,
<title foo="bar">:

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

William James · Mar 1, 2008

You want the Hpricot gem.

No, he doesn't.

Marc Heiler · Mar 1, 2008

You want the Hpricot gem.

Personally I agree on that, insofar that I think the most simple,
"default" ruby solution is better than a specialized one. In this case I
think the better solution is Net::HTTP

Todd Benson · Mar 1, 2008

No, he doesn't.

Same question, different people, same strict requirements. It sounds
a little like homework. In that case, I suppose some of the regexp
solutions provided will work (for this small use case).

I still think Florian said it best, though. Unless you can "stack",
you won't be able to correctly reveal the components inside a nested
language structure. I haven't looked into the theory, but I can
attest to the pain in the arse I've had trying to scrape with regular
expressions.

Todd

Tom Arra · Mar 1, 2008

William said:
</body>
</html>

I want the script to return "test"

Click to expand...

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]

If the tag can contain attributes, e.g.,
<title foo="bar">:

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

So far I think this is closest to what I am looking for. I need to go to
a website that has a server information and pull that out of the HTML.
Then take that info and spit it back out to the user. If I am
understanding the code above, it at least does the first part which I
had no clue how to do.

Tom Arra · Mar 1, 2008

Tom said:
William said:

</body>
</html>

I want the script to return "test"

Click to expand...

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]

If the tag can contain attributes, e.g.,
<title foo="bar">:

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

Click to expand...

So far I think this is closest to what I am looking for. I need to go to
a website that has a server information and pull that out of the HTML.
Then take that info and spit it back out to the user. If I am
understanding the code above, it at least does the first part which I
had no clue how to do.

Well I just tried it and it worked like a charm. My next thing is to
limit what it brings back.

Example
<h3>blah blah blah 7.0.0.3.4 blah blah blah</h3>

I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
is going to have to deal with more regular expressions but I never
really understood how to use them well.

William James · Mar 1, 2008

Tom said:
Tom said:

William said:

</body>
</html>
I want the script to return "test"
--
Posted viahttp://www.ruby-forum.com/.
require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]
If the tag can contain attributes, e.g.,
<title foo="bar">:
require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

Click to expand...

Click to expand...

So far I think this is closest to what I am looking for. I need to go to
a website that has a server information and pull that out of the HTML.
Then take that info and spit it back out to the user. If I am
understanding the code above, it at least does the first part which I
had no clue how to do.

Click to expand...

Well I just tried it and it worked like a charm. My next thing is to
limit what it brings back.

Example
<h3>blah blah blah 7.0.0.3.4 blah blah blah</h3>

I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
is going to have to deal with more regular expressions but I never
really understood how to use them well.

E:\>irb --prompt xmp
s = " <h3>blah blah 7.0.0.3.4 blah</h3>"
==>" <h3>blah blah 7.0.0.3.4 blah</h3>"
# Find a substring composed of numerals and dots that is
# at least 3 characters long.
s[ /[\d.]{3,}/ ]
==>"7.0.0.3.4"

Tom Arra · Mar 1, 2008

William said:
I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
is going to have to deal with more regular expressions but I never
really understood how to use them well.

Click to expand...

E:\>irb --prompt xmp
s = " <h3>blah blah 7.0.0.3.4 blah</h3>"
==>" <h3>blah blah 7.0.0.3.4 blah</h3>"
# Find a substring composed of numerals and dots that is
# at least 3 characters long.
s[ /[\d.]{3,}/ ]
==>"7.0.0.3.4"

Your really good at this stuff! One thing i noticed is that it works
perfectly for the regular domain but as soon as I put a full URL into
the Net::HTTP.new command it starts to throw errors. Any ideas.

Tom Arra · Mar 1, 2008

Heres what I have so far

#! /usr/bin/ruby
require 'net/http'

text = Net::HTTP.new('www.tomarra.com').get('/').body[
%r{<title\s*>(.*?)</title\s*>}mi, 1 ]
print "TomArra.com Title Tag: "
print text
print "\n"
s = "<h3>blah blah 7.0.0.4.3 blah blah</h3>"[ /[\d.]{3,}/ ]
print s

puts Net::HTTP.new('www.tomarra.com/credits.html').get('/').body[
%r{<center\s*>(.*?)</center\s*>}mi, 1 ]

and here is my output
TomArra.com Title Tag: Welcome To TomArra.com
7.0.0.4.3
SocketError: getaddrinfo: nodename nor servname provided, or not known

method initialize in http.rb at line 564
method open in http.rb at line 564
method connect in http.rb at line 564
method timeout in timeout.rb at line 48
method timeout in timeout.rb at line 76
method connect in http.rb at line 564
method do_start in http.rb at line 557
method start in http.rb at line 546
method request in http.rb at line 1044
method get in http.rb at line 781
at top level in simple.rb at line 11
Program exited.

William James · Mar 1, 2008

Tom said:
William said:

require 'net/http'
So far I think this is closest to what I am looking for. I need to go to

I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
is going to have to deal with more regular expressions but I never
really understood how to use them well.

Click to expand...

E:\>irb --prompt xmp
s = " <h3>blah blah 7.0.0.3.4 blah</h3>"
==>" <h3>blah blah 7.0.0.3.4 blah</h3>"
# Find a substring composed of numerals and dots that is
# at least 3 characters long.
s[ /[\d.]{3,}/ ]
==>"7.0.0.3.4"

Click to expand...

Your really good at this stuff! One thing i noticed is that it works
perfectly for the regular domain but as soon as I put a full URL into
the Net::HTTP.new command it starts to throw errors. Any ideas.

Use the rest of the URL as the argument for ".get()":

require 'net/http'
puts Net::HTTP.new('www.newlisp.org').get('/index.cgi?Documentation').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

Tom Arra · Mar 1, 2008

William said:
Tom said:

the Net::HTTP.new command it starts to throw errors. Any ideas.

Click to expand...

Use the rest of the URL as the argument for ".get()":

require 'net/http'
puts Net::HTTP.new('www.newlisp.org').get('/index.cgi?Documentation').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

Works like a charm thanks for all your help!!

Tom Arra · Mar 3, 2008

One more little problem. I noticed that this net/http method automaticly
puts in port 80. Problem is that I need to get to a different port.
There has to be a way around this, right?

Tom Arra · Mar 3, 2008

Tom said:
One more little problem. I noticed that this net/http method automaticly
puts in port 80. Problem is that I need to get to a different port.
There has to be a way around this, right?

Nevermind just figured it out

puts Net::HTTP.new('<<Server Here>>',<<port # here>>)

HTML form to csv file on server	1	Feb 12, 2025
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
I'm about to get in trouble with the HTML <body></body> tags	10	Aug 12, 2023
HTML Assessment for interview	2	Feb 16, 2024
How do I install a loader?	1	Sep 20, 2024
Search Results with Pagination	1	Oct 25, 2024
HELP WITH MediaSource	2	Dec 6, 2024
Setup a portion of html page as scrollable?	25	Jan 7, 2025

Scan HTML

Tom Arra

Gregory Seidman

William James

William James

Marc Heiler

Todd Benson

Tom Arra

Tom Arra

William James

Tom Arra

Tom Arra

William James

Tom Arra

Tom Arra

Tom Arra

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads