read all available pages on a Website

Brad Tilley · Sep 13, 2004

Is there a way to make urllib or urllib2 read all of the pages on a Web
site? For example, say I wanted to read each page of www.python.org into
separate strings (a string for each page). The problem is that I don't
know how many pages are at www.python.org. How can I handle this?

Thanks,

Brad

Tim Roberts · Sep 13, 2004

Brad Tilley said:
Is there a way to make urllib or urllib2 read all of the pages on a Web
site? For example, say I wanted to read each page of www.python.org into
separate strings (a string for each page). The problem is that I don't
know how many pages are at www.python.org. How can I handle this?

You have to parse the HTML to pull out all the links and images and fetch
them, one by one. sgmllib can help with the parsing. You can multithread
this, if performance in an issue.

By the way, there are many web sites for which this sort of behavior is not
welcome.

Leif K-Brooks · Sep 13, 2004

Tim said:
By the way, there are many web sites for which this sort of behavior is not
welcome.

Any site that didn't want to be crawled would most likely use a
robots.txt file, so you could check that before doing the crawl.

Alex Martelli · Sep 13, 2004

Leif K-Brooks said:
Any site that didn't want to be crawled would most likely use a
robots.txt file, so you could check that before doing the crawl.

Python's Tools/webchecker/ directory has just the code you need for all
of this. The directory is part of the Python source distribution, but
it's all pure Python code, so, if your distribution is binary and omits
that directory, just download the Python source distribution, unpack it,
and there you are.

Alex

Carlos Ribeiro · Sep 13, 2004

Brad,

Just to clarify something other posters have said. Automatic crawling
of websites is not welcome primarily because of performance concerns.
It also may be regarded by some webmasters a a kind of abuse, because
the crawler is doing 'hits' and copying material for unknown reasons,
but is not seeing any ad or generating revenue. Some sites even go to
the extent of blocking access from your IP, or even for your entire IP
range, when they detect this type of behavior. Because of this, there
is a very simple procol involving a file called "robots.txt". Whenever
your robot first enter into a site, it must check this file and follow
the instructions there. It will tell you what you can do in that
website.

There are also other few catches that you need to be aware of. First,
some sites don't have links pointing to all their pages, so it's never
possible to be completely sure about having read *all* pages. Also,
some sites have link embedded into scripts. It's not a recommended
practice, but it's common at some sites, and it may cause you
problems. And finally, there are situations where your robot may be
stuck into an "infinite site"; that's because some sites generate
pages dinamically, and your robot may end up fetching page after page
and never get out of the site. So, if you want a generic solution to
crawl any site you desire, you have to check out these issues.

Best regards,

--
Carlos Ribeiro
Consultoria em Projetos
blog: http://rascunhosrotos.blogspot.com
blog: http://pythonnotes.blogspot.com
mail: (e-mail address removed)
mail: (e-mail address removed)

Michael Foord · Sep 13, 2004

Brad Tilley said:
Is there a way to make urllib or urllib2 read all of the pages on a Web
site? For example, say I wanted to read each page of www.python.org into
separate strings (a string for each page). The problem is that I don't
know how many pages are at www.python.org. How can I handle this?

Thanks,

Brad

I can highly reccommend the BeautifulSoup parser for helping you to
extract all the links - should make it a doddle. (you want to check
that you only follwo links that are in www.python.org of course - the
standard library urlparse should help with that).

Regards,

Fuzzy

http://www.voidspace.org.uk/atlantibots/pythonutils.html

Brad Tilley · Sep 13, 2004

Alex said:
Python's Tools/webchecker/ directory has just the code you need for all
of this. The directory is part of the Python source distribution, but
it's all pure Python code, so, if your distribution is binary and omits
that directory, just download the Python source distribution, unpack it,
and there you are.

Alex

Thank you, this is ideal.

First website practice project advisement	4	Jul 5, 2023
Search multiple pages	7	Sep 4, 2023
How to loop through all the other pages in a pdf using python	3	May 16, 2023
Reverse search for a website	2	Apr 24, 2024
A website that I couldn't make a screenshot of it nor save any page from.	1	Oct 29, 2023
I need help fixing my website	2	Oct 15, 2023
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
MDX pages not rendering in Gatsby.js	0	Oct 25, 2023

read all available pages on a Website

Brad Tilley

Tim Roberts

Leif K-Brooks

Alex Martelli

Carlos Ribeiro

Michael Foord

Brad Tilley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads