how to view net http get response for a web spider tutorial

A

anne001

I am trying to run the ruby lynux tutorial on web spiders
http://www.linux-magazine.com/issue/51/Ruby_Web_Spiders.pdf

Apparently, the information has moved on the livejournal.com. The
tutorial says
"you can get a wealth of information from the http response object
returned by the
net http get function"

If I use puts data, I get
The document has moved <A
HREF="http://anne.livejournal.com/profile">here</A>.<P
if I use puts resp, I get
`puts': stack level too deep (SystemStackError)

Is the tutorial refering to the data part of the response only?

Here is a cgi script on livejournal
#!/usr/bin/perl
use LWP::Simple;
print "Content-type: text/html\n\n";
print get('http://www.livejournal.com/customview.cgi' .
'?username=username&styleid=101');
http://www.livejournal.com/developer/embedding.bml?method=cgi

Here is the program I am working with
-------------------------------------------------------->
require 'net/http'
h=Net::HTTP.new('www.livejournal.com', 80)
friend_arr = []
person= ARGV[0]

resp, data =
h.get("http://www.livejournal.com/userinfo.bml?user=#{person}",nil)
print "Friend list for #{person}\n"

#puts resp
puts data

data.split("\n").each do |line|
line.split(",").each do |token|
if token =~ /userinfo.bml\?user=([^'&]*)\'/
friend_arr.push #1
print #$1\n"
end
end
end

print "\n"

friend_arr.each do |friend|
print "Parsing #{friend}'s. journal for #{person}'s. comments...\n";
f=File.new("#{person}_#{friend}.txt","w")
f.puts 'ruby parse_journal.rb #{friend}#{person}'
f.close

end
 
A

anne001

PS I understand what data was telling me, and now I have the data:

resp, data = h.get("http://#{person}.livejournal.com/profile",nil)

What is resp for, why does the puts give me the stack level too deep?
 
A

anne001

I could also use a little help with the following questions:

1. Where can I find what the particular "regular expressions" used here
do

2. f.puts 'ruby parse_journal.rb #{friend} #{person}'
single quotes do not work, it print verbatim the above text in the file
f.puts "ruby parse_journal.rb #{friend} #{person}"
double quotes do not seem to call the parse_journal.rb either,
although it does replace friend and person before writing the tex to
file.

how can I call parse_journal from parse_user>? what am I doing wrong ?

3. How to call a program with two arguments?
when I try to call the second program manually it does not know
watch_for

ruby parse_journal.rb "hozro" "leif"

person= ARGV[0]
watch_for= ARGV[1]
puts person
puts warch_for
undefined local variable or method `warch_for' for main:Object
(NameError)
 
R

rmagick

anne001 said:
I could also use a little help with the following questions:


2. f.puts 'ruby parse_journal.rb #{friend} #{person}'
single quotes do not work, it print verbatim the above text in the file
f.puts "ruby parse_journal.rb #{friend} #{person}"
double quotes do not seem to call the parse_journal.rb either,
although it does replace friend and person before writing the tex to
file.

Try backquotes: `ruby parse_journal.rb....`
how can I call parse_journal from parse_user>? what am I doing wrong ?

3. How to call a program with two arguments?
when I try to call the second program manually it does not know
watch_for

ruby parse_journal.rb "hozro" "leif"

person= ARGV[0]
watch_for= ARGV[1]
puts person
puts warch_for
undefined local variable or method `warch_for' for main:Object
(NameError)

Looks like you misspelled 'watch_for' as 'warch_for'.
 
R

Robert Klemme

anne001 said:
I could also use a little help with the following questions:

1. Where can I find what the particular "regular expressions" used here
do

I don't see a regexp here. Do you want general information? Does this
help?

http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html#UJ
2. f.puts 'ruby parse_journal.rb #{friend} #{person}'
single quotes do not work, it print verbatim the above text in the file
f.puts "ruby parse_journal.rb #{friend} #{person}"
double quotes do not seem to call the parse_journal.rb either,
although it does replace friend and person before writing the tex to
file.

how can I call parse_journal from parse_user>? what am I doing wrong ?

You need backticks or %x{}:

f.puts `ruby parse_journal.rb #{friend} #{person}`
f.puts %x[ruby parse_journal.rb #{friend} #{person}]

If friend and person can contain whitespace you probably need to add
single quotes around #{friend} and #{person}.
3. How to call a program with two arguments?
when I try to call the second program manually it does not know
watch_for

ruby parse_journal.rb "hozro" "leif"

person= ARGV[0]
watch_for= ARGV[1]
puts person
puts warch_for
undefined local variable or method `warch_for' for main:Object
(NameError)

This is just a typo. :)

Kind regards

robert
 
A

anne001

Thank you for your response. This linux tutorial seems to be old, lots
of it does not work

0. I think net http get only returns one object, I think that is why I
am geting an error when I try to look at resp
h=Net::HTTP.new('www.livejournal.com', 80)
resp, data = h.get("http://#{person}.livejournal.com/profile",nil)

1. I do have the book and also sam's ruby, but I did not find this:
([^'&]*)
in this expression: /userinfo.bml\?user=([^'&]*)\'/
the parenthesis means treat it as a group, ^ means from start ... but I
can't decipher the rest.
he also uses ([^<]*), ([^"]*),

somehow the expression is linked to an argument $1
if token =~ /http:\/\/([^'&]*).livejournal.com\/profile/
friend_arr.push $1

in another example the expression is linked to two arguments $1 and $2
if line =~
/<ahref='(http"\/\/www.livejournal.com\/users\/[^\/]*\/[^']*thread[^']*)'>.*<\/td><\/tr><tr><td>(.*)<\
p style='margin:/ and logging
url = $1
comment = $2

3. Oups, just a typo, I guess spaces between arguments is ok then.
 
R

Robert Klemme

anne001 said:
Thank you for your response. This linux tutorial seems to be old, lots
of it does not work

0. I think net http get only returns one object, I think that is why I
am geting an error when I try to look at resp
h=Net::HTTP.new('www.livejournal.com', 80)
resp, data = h.get("http://#{person}.livejournal.com/profile",nil)

1. I do have the book and also sam's ruby, but I did not find this:
([^'&]*)
in this expression: /userinfo.bml\?user=([^'&]*)\'/
the parenthesis means treat it as a group, ^ means from start ... but I
can't decipher the rest.
he also uses ([^<]*), ([^"]*),

Caret doesn't mean "from start" here. You probably overlooked the
brackets. [^'&] is a character class and here "^" means "not", i.e. all
characters that are not ' and not &. Basically this regexp extracts the
value of HTTP parameter "user" from a URL or part of a URL. The
backslash before the single quote is meaningless IMHO.
somehow the expression is linked to an argument $1
if token =~ /http:\/\/([^'&]*).livejournal.com\/profile/
friend_arr.push $1

Yes, $1 receives the value of the group.
in another example the expression is linked to two arguments $1 and $2
if line =~
/<ahref='(http"\/\/www.livejournal.com\/users\/[^\/]*\/[^']*thread[^']*)'>.*<\/td><\/tr><tr><td>(.*)<\
p style='margin:/ and logging
url = $1
comment = $2

There are two groups, $1 receives the value of group 1 and $2 .... (left
as excercise :)).
3. Oups, just a typo, I guess spaces between arguments is ok then.

Yupp. In fact in the shell spaces *separate* arguments. :)

Kind regards

robert
 
A

anne001

Thanks I was able to figure out the () once the [] part became clear
to me.

1. I need some help running parse_journal.rb from parse_user.rb:
---------------------------------------------------------------------------------------------
When I run the parse_journal.rb program:
ruby parse_journal.rb "meep" "mizalaina"
I get:

meep
mizalaina
http://meep.livejournal.com/1328508.html?thread=2564732#t2564732
Ha ha ha! Awesome quote.

the first two output lines echo the arguments, the program starts with
---------------------->
require 'net/http'

person= ARGV[0]
watch_for= ARGV[1]
puts person
puts watch_for
....

but when I try to execute parse_office.rb from parse_user.rb I don't
get anything
friend_arr.each do |friend|
if friend == "meep"
print "Parsing #{friend}'s. journal for #{person}'s.
comments...\n";
f=File.new("#{person}_#{friend}.txt","w")
f.puts %x[ruby parse_journal.rb #{friend} #{person}]
f.close
end
end

ie I only get
Parsing meep's. journal for mizalaina's. comments...

the puts in the parse_journal.rb are at the start of the file and
should execute right away.
What is the problem, why do the puts not execute, how do I debug this?

2. Where can I find information for ruby to help me meet the journal's
bot requirements:
a) webcrawlers should only make 5 get per second,
b) cache the results of the bot's requests
c) well formed user agent which includes contact email address for
the bot maintainer.

thank you for your help!
 
R

Robert Klemme

anne001 said:
Thanks I was able to figure out the () once the [] part became clear
to me.

1. I need some help running parse_journal.rb from parse_user.rb:
---------------------------------------------------------------------------------------------
When I run the parse_journal.rb program:
ruby parse_journal.rb "meep" "mizalaina"
I get:

meep
mizalaina
http://meep.livejournal.com/1328508.html?thread=2564732#t2564732
Ha ha ha! Awesome quote.

the first two output lines echo the arguments, the program starts with
---------------------->
require 'net/http'

person= ARGV[0]
watch_for= ARGV[1]
puts person
puts watch_for
...

but when I try to execute parse_office.rb from parse_user.rb I don't

Did you mean "parse_journal.rb" instead of "parse_office.rb"?
get anything
friend_arr.each do |friend|
if friend == "meep"
print "Parsing #{friend}'s. journal for #{person}'s.
comments...\n";
f=File.new("#{person}_#{friend}.txt","w")
f.puts %x[ruby parse_journal.rb #{friend} #{person}]
f.close
end
end

Btw, better use the block form for files:

File.open("#{person}_#{friend}.txt","w") |f|
f.puts %x[ruby parse_journal.rb #{friend} #{person}]
end
ie I only get
Parsing meep's. journal for mizalaina's. comments...

the puts in the parse_journal.rb are at the start of the file and
should execute right away.
What is the problem, why do the puts not execute, how do I debug this?

2. Where can I find information for ruby to help me meet the journal's
bot requirements:
a) webcrawlers should only make 5 get per second,

The easiest would be to issue 5 gets and then sleep for a second. That
way you're definitely on the safe side although you sacrifice a bit of
performance.
b) cache the results of the bot's requests

Depends on what your bot's requests are. A simple means is to just
marshal them into a file or use PStore.
c) well formed user agent which includes contact email address for
the bot maintainer.

No idea. Maybe there's some constant that has to be changed. Did you
look into net/http.rb?

Kind regards

robert
 
A

anne001

Thank you, I will use the proc then.

I figured the output problem. The puts is redirected to f.puts, so if I
want to see the output, I have to go to the file!!!
Took me a while!

Thank you for your suggestions: save all http gets to corresponding
files, and before doing a get, check if the file is not already on the
local server. And waiting every 5 gets should be easy enough. Since
the livejournal does not provide comment
info anymore in their RSS feed, you have to open every blog, it is
already so slow as it is I did not put this change in. Here are the two
programs in case someone is interested in this linus tutorial.

parse_user.rb
-------------------->
require 'net/http'
h=Net::HTTP.new('www.livejournal.com', 80)
friend_arr = []
person= ARGV[0]

resp, data = h.get("http://#{person}.livejournal.com/profile",nil)

print "Friend list for #{person}\n"

data.split("\n").each do |line|
line.split(",").each do |token|
if token =~ /http:\/\/([^'&]*).livejournal.com\/profile/
friend_arr.push $1
print "#$1\n"
end
end
end

print "\n"

friend_arr.each do |friend|
if friend != person
print "Parsing #{friend}'s. journal for #{person}'s.
comments...\n";
File.open("#{person}_#{friend}.txt","w") { |f| f.puts %x[ruby
parse_journal.rb #{friend} #{person}] }
end
end

parse_journal.rb
------------------------>
require 'net/http'

person= ARGV[0]
watch_for= ARGV[1]

h=Net::HTTP.new('www.livejournal.com',80)
ar =[];

resp, data = h.get("http://#{person}.livejournal.com/data/rss", nil)

data.split("\n").each do |line|
if line =~ /#{person}.livejournal.com\/([0-9]*).html/
ar.push $1.to_i
end
end

# to keep one copy of uniq
# url_array = ar.uniq.find_all {|x| ar.find_all {|y| y == x }.size == 1
}
# to keep one copy of non uniq
url_array = ar.uniq.find_all {|x| ar.find_all {|y| y == x }.size > 1 }
#url_array=[ 1328508, 13268887]

url_array.each { |urlid|
resp, data =
h.get("http:\/\/#{person}.livejournal.com\/#{urlid}.html", nil)

lkforp=1
data.split("\n").each do |line|
line.split(",").each do |token|
if (lkforp ==1)
if token =~
/http:\/\/([^'&]*).livejournal.com\/profile/
if( $1 == watch_for)
lkforp=2
end
end
else
if token =~
/(http:\/\/#{person}.livejournal.com\/#{urlid}.html[^']thread=[^']*)\'(.*)<\/td><\/tr>\<tr><td>(.*)<p
style=\'margin/
url = $1
comment = $3
comment.gsub!('<br />',"\n")
comment.gsub!(/<\/*[^>]*>/,'')
print "#{url}\n#{comment}\n"
lkforp=1
end
end
end
end
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top