read all available pages on a Website

B

Brad Tilley

Is there a way to make urllib or urllib2 read all of the pages on a Web
site? For example, say I wanted to read each page of www.python.org into
separate strings (a string for each page). The problem is that I don't
know how many pages are at www.python.org. How can I handle this?

Thanks,

Brad
 
T

Tim Roberts

Brad Tilley said:
Is there a way to make urllib or urllib2 read all of the pages on a Web
site? For example, say I wanted to read each page of www.python.org into
separate strings (a string for each page). The problem is that I don't
know how many pages are at www.python.org. How can I handle this?

You have to parse the HTML to pull out all the links and images and fetch
them, one by one. sgmllib can help with the parsing. You can multithread
this, if performance in an issue.

By the way, there are many web sites for which this sort of behavior is not
welcome.
 
L

Leif K-Brooks

Tim said:
By the way, there are many web sites for which this sort of behavior is not
welcome.

Any site that didn't want to be crawled would most likely use a
robots.txt file, so you could check that before doing the crawl.
 
A

Alex Martelli

Leif K-Brooks said:
Any site that didn't want to be crawled would most likely use a
robots.txt file, so you could check that before doing the crawl.

Python's Tools/webchecker/ directory has just the code you need for all
of this. The directory is part of the Python source distribution, but
it's all pure Python code, so, if your distribution is binary and omits
that directory, just download the Python source distribution, unpack it,
and there you are.


Alex
 
C

Carlos Ribeiro

Brad,

Just to clarify something other posters have said. Automatic crawling
of websites is not welcome primarily because of performance concerns.
It also may be regarded by some webmasters a a kind of abuse, because
the crawler is doing 'hits' and copying material for unknown reasons,
but is not seeing any ad or generating revenue. Some sites even go to
the extent of blocking access from your IP, or even for your entire IP
range, when they detect this type of behavior. Because of this, there
is a very simple procol involving a file called "robots.txt". Whenever
your robot first enter into a site, it must check this file and follow
the instructions there. It will tell you what you can do in that
website.

There are also other few catches that you need to be aware of. First,
some sites don't have links pointing to all their pages, so it's never
possible to be completely sure about having read *all* pages. Also,
some sites have link embedded into scripts. It's not a recommended
practice, but it's common at some sites, and it may cause you
problems. And finally, there are situations where your robot may be
stuck into an "infinite site"; that's because some sites generate
pages dinamically, and your robot may end up fetching page after page
and never get out of the site. So, if you want a generic solution to
crawl any site you desire, you have to check out these issues.


Best regards,


--
Carlos Ribeiro
Consultoria em Projetos
blog: http://rascunhosrotos.blogspot.com
blog: http://pythonnotes.blogspot.com
mail: (e-mail address removed)
mail: (e-mail address removed)
 
M

Michael Foord

Brad Tilley said:
Is there a way to make urllib or urllib2 read all of the pages on a Web
site? For example, say I wanted to read each page of www.python.org into
separate strings (a string for each page). The problem is that I don't
know how many pages are at www.python.org. How can I handle this?

Thanks,

Brad

I can highly reccommend the BeautifulSoup parser for helping you to
extract all the links - should make it a doddle. (you want to check
that you only follwo links that are in www.python.org of course - the
standard library urlparse should help with that).

Regards,


Fuzzy

http://www.voidspace.org.uk/atlantibots/pythonutils.html
 
B

Brad Tilley

Alex said:
Python's Tools/webchecker/ directory has just the code you need for all
of this. The directory is part of the Python source distribution, but
it's all pure Python code, so, if your distribution is binary and omits
that directory, just download the Python source distribution, unpack it,
and there you are.


Alex

Thank you, this is ideal.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,206
Messages
2,571,069
Members
47,675
Latest member
RollandKna

Latest Threads

Top