H
Hal Vaughan
I'm exploring LWP and trying to write a program that will pull down some web
pages. When I read one page, I use regular expressions to find the links
for other pages I want to download. Sometimes the links are relative
(like /cgi/link.pl or subdir/newfile.html) instead of including a domain
name. I don't see anything in the doc files about any consistency from one
connection to another.
Is there any module out there for keeping track of domains and handling
relative URLs?
I thought about writing a program to look for them, but it seems rather hard
to distinguish if a string is a domain name (I'd look for periods, but
can't be sure it'll include a .com, .gov, or anything else unless I check
all TLDs), and some URLs might not have a slash (if it's a domain name
only, or just a file in the same directory), so I can't think of a way to
be sure a string includes a domain and full path or is a relative URL
(other than trying to load it, and checking the error messag).
I would think there's a module or something to help handle this either by
tracking links used OR by easily determining if a link is absolute or
relative.
Thanks!
Hal
pages. When I read one page, I use regular expressions to find the links
for other pages I want to download. Sometimes the links are relative
(like /cgi/link.pl or subdir/newfile.html) instead of including a domain
name. I don't see anything in the doc files about any consistency from one
connection to another.
Is there any module out there for keeping track of domains and handling
relative URLs?
I thought about writing a program to look for them, but it seems rather hard
to distinguish if a string is a domain name (I'd look for periods, but
can't be sure it'll include a .com, .gov, or anything else unless I check
all TLDs), and some URLs might not have a slash (if it's a domain name
only, or just a file in the same directory), so I can't think of a way to
be sure a string includes a domain and full path or is a relative URL
(other than trying to load it, and checking the error messag).
I would think there's a module or something to help handle this either by
tracking links used OR by easily determining if a link is absolute or
relative.
Thanks!
Hal