Scan HTML

T

Tom Arra

So I am new to Ruby scripting so I am not sure if this is possible or
not. I want to make a script that will load a webpage and then search
through the HTML of that page until it hits a certain tag. Once it hits
that tag it need to grab all of the text between the tag and the
appropriate end tag. Is something like this possible?

Example
<html>
<body>
<h3>test</h3>
</body>
</html>

I want the script to return "test"
 
G

Gregory Seidman

So I am new to Ruby scripting so I am not sure if this is possible or
not. I want to make a script that will load a webpage and then search
through the HTML of that page until it hits a certain tag. Once it hits
that tag it need to grab all of the text between the tag and the
appropriate end tag. Is something like this possible?

Example
<html>
<body>
<h3>test</h3>
</body>
</html>

I want the script to return "test"

You want the Hpricot gem.

require 'rubygems'
require 'hpricot'

html = <<EOF
<html>
<body>
<h3>test</h3>
</body>
</html>
EOF

doc = Hpricot(html)

puts (doc/'h3').first.inner_text

--Greg
 
W

William James

So I am new to Ruby scripting so I am not sure if this is possible or
not. I want to make a script that will load a webpage and then search
through the HTML of that page until it hits a certain tag. Once it hits
that tag it need to grab all of the text between the tag and the
appropriate end tag. Is something like this possible?

Example
<html>
<body>
<h3>test</h3>
</body>
</html>

I want the script to return "test"

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]

If the tag can contain attributes, e.g.,
<title foo="bar">:

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]
 
M

Marc Heiler

You want the Hpricot gem.

Personally I agree on that, insofar that I think the most simple,
"default" ruby solution is better than a specialized one. In this case I
think the better solution is Net::HTTP
 
T

Todd Benson

No, he doesn't.

Same question, different people, same strict requirements. It sounds
a little like homework. In that case, I suppose some of the regexp
solutions provided will work (for this small use case).

I still think Florian said it best, though. Unless you can "stack",
you won't be able to correctly reveal the components inside a nested
language structure. I haven't looked into the theory, but I can
attest to the pain in the arse I've had trying to scrape with regular
expressions.

Todd
 
T

Tom Arra

William said:
</body>
</html>

I want the script to return "test"

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]

If the tag can contain attributes, e.g.,
<title foo="bar">:

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

So far I think this is closest to what I am looking for. I need to go to
a website that has a server information and pull that out of the HTML.
Then take that info and spit it back out to the user. If I am
understanding the code above, it at least does the first part which I
had no clue how to do.
 
T

Tom Arra

Tom said:
William said:
</body>
</html>

I want the script to return "test"

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]

If the tag can contain attributes, e.g.,
<title foo="bar">:

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

So far I think this is closest to what I am looking for. I need to go to
a website that has a server information and pull that out of the HTML.
Then take that info and spit it back out to the user. If I am
understanding the code above, it at least does the first part which I
had no clue how to do.

Well I just tried it and it worked like a charm. My next thing is to
limit what it brings back.

Example
<h3>blah blah blah 7.0.0.3.4 blah blah blah</h3>

I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
is going to have to deal with more regular expressions but I never
really understood how to use them well.
 
W

William James

Tom said:
William said:
</body>
</html>
I want the script to return "test"
--
Posted viahttp://www.ruby-forum.com/.
require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]
If the tag can contain attributes, e.g.,
<title foo="bar">:
require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]
So far I think this is closest to what I am looking for. I need to go to
a website that has a server information and pull that out of the HTML.
Then take that info and spit it back out to the user. If I am
understanding the code above, it at least does the first part which I
had no clue how to do.

Well I just tried it and it worked like a charm. My next thing is to
limit what it brings back.

Example
<h3>blah blah blah 7.0.0.3.4 blah blah blah</h3>

I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
is going to have to deal with more regular expressions but I never
really understood how to use them well.

E:\>irb --prompt xmp
s = " <h3>blah blah 7.0.0.3.4 blah</h3>"
==>" <h3>blah blah 7.0.0.3.4 blah</h3>"
# Find a substring composed of numerals and dots that is
# at least 3 characters long.
s[ /[\d.]{3,}/ ]
==>"7.0.0.3.4"
 
T

Tom Arra

William said:
I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
is going to have to deal with more regular expressions but I never
really understood how to use them well.

E:\>irb --prompt xmp
s = " <h3>blah blah 7.0.0.3.4 blah</h3>"
==>" <h3>blah blah 7.0.0.3.4 blah</h3>"
# Find a substring composed of numerals and dots that is
# at least 3 characters long.
s[ /[\d.]{3,}/ ]
==>"7.0.0.3.4"

Your really good at this stuff! One thing i noticed is that it works
perfectly for the regular domain but as soon as I put a full URL into
the Net::HTTP.new command it starts to throw errors. Any ideas.
 
T

Tom Arra

Heres what I have so far

#! /usr/bin/ruby
require 'net/http'

text = Net::HTTP.new('www.tomarra.com').get('/').body[
%r{<title\s*>(.*?)</title\s*>}mi, 1 ]
print "TomArra.com Title Tag: "
print text
print "\n"
s = "<h3>blah blah 7.0.0.4.3 blah blah</h3>"[ /[\d.]{3,}/ ]
print s

puts Net::HTTP.new('www.tomarra.com/credits.html').get('/').body[
%r{<center\s*>(.*?)</center\s*>}mi, 1 ]

and here is my output
TomArra.com Title Tag: Welcome To TomArra.com
7.0.0.4.3
SocketError: getaddrinfo: nodename nor servname provided, or not known

method initialize in http.rb at line 564
method open in http.rb at line 564
method connect in http.rb at line 564
method timeout in timeout.rb at line 48
method timeout in timeout.rb at line 76
method connect in http.rb at line 564
method do_start in http.rb at line 557
method start in http.rb at line 546
method request in http.rb at line 1044
method get in http.rb at line 781
at top level in simple.rb at line 11
Program exited.
 
W

William James

Tom said:
William said:
require 'net/http'
So far I think this is closest to what I am looking for. I need to go to

I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
is going to have to deal with more regular expressions but I never
really understood how to use them well.

E:\>irb --prompt xmp
s = " <h3>blah blah 7.0.0.3.4 blah</h3>"
==>" <h3>blah blah 7.0.0.3.4 blah</h3>"
# Find a substring composed of numerals and dots that is
# at least 3 characters long.
s[ /[\d.]{3,}/ ]
==>"7.0.0.3.4"

Your really good at this stuff! One thing i noticed is that it works
perfectly for the regular domain but as soon as I put a full URL into
the Net::HTTP.new command it starts to throw errors. Any ideas.

Use the rest of the URL as the argument for ".get()":

require 'net/http'
puts Net::HTTP.new('www.newlisp.org').get('/index.cgi?Documentation').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]
 
T

Tom Arra

One more little problem. I noticed that this net/http method automaticly
puts in port 80. Problem is that I need to get to a different port.
There has to be a way around this, right?
 
T

Tom Arra

Tom said:
One more little problem. I noticed that this net/http method automaticly
puts in port 80. Problem is that I need to get to a different port.
There has to be a way around this, right?

Nevermind just figured it out

puts Net::HTTP.new('<<Server Here>>',<<port # here>>)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
474,285
Messages
2,571,416
Members
48,107
Latest member
AmeliaAmad

Latest Threads

Top