How can I follow links in my website

D

Danny

I would like to browse a page in one of my websites and get info to populate
a database. But each page will have a NEXT and PREVIOUS link that takes you
to another page.

I need something to look at one page and save it to a file on the HD, then
follow the NEXT link and go to the next page, and do the same thing, and so
on.

Can this be done?
 
E

Eric Bohlman

I would like to browse a page in one of my websites and get info to
populate a database. But each page will have a NEXT and PREVIOUS link
that takes you to another page.

I need something to look at one page and save it to a file on the HD,
then follow the NEXT link and go to the next page, and do the same
thing, and so on.

Can this be done?

Yep: LWP::Simple and HTML::LinkExtor together ought to do the trick.
 
J

John Bokma

Danny said:
I would like to browse a page in one of my websites and get info to populate
a database. But each page will have a NEXT and PREVIOUS link that takes you
to another page.

I need something to look at one page and save it to a file on the HD, then
follow the NEXT link and go to the next page, and do the same thing, and so
on.

Can this be done?

Yes.

check the lwpcookbook, and HTML::parser, for example. It's possible to
not use the parser, but just a regexp if you know what you are doing :-D.
 
D

Danny

John Bokma said:
Yes.

check the lwpcookbook, and HTML::parser, for example. It's possible to
not use the parser, but just a regexp if you know what you are doing :-D.


Thanks for your responses.
I have a sample that works, in that it gets a webpage, prints the contents
of the website to a text file and then prints all the links in the website.
Now I just want to follow the links in that website that have "nextpage" in
the link and so on (this means it goes to the next category page). and I
want to save each page to a text file like page1.txt, page2.txt etc etc

this script works but I am not sure where to put loops. I am still
learning.

HOw can I do this?
I would appreciate your help.
Thanks again
Danny

-------
use CGI;

$co = new CGI;
use LWP::Simple;
use HTML::LinkExtor;
print $co->header;
$html = get("http://www.website.com");
$link_extor = HTML::LinkExtor->new(\&handle_links);
$link_extor->parse($html);
use LWP::UserAgent;
$user_agent = new LWP::UserAgent;

$request = new HTTP::Request('GET','http://www.website.com');
$response = $user_agent->request($request);
open FILEHANDLE, ">file.txt";
print FILEHANDLE $response->{_content};
close FILEHANDLE;

sub handle_links
{
($tag, %links) = @_;
if ($tag eq 'a') {
foreach $key (keys %links) {
if ($key eq 'href') {
# I assume I put a test here for the NEXT link and then this gets
loades as above in REQUEST statement?
print "This is a link: $links{$key}.\n";
}
}
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,147
Messages
2,570,835
Members
47,383
Latest member
EzraGiffor

Latest Threads

Top