Is Python good for web crawlers?

T

Tempo

I was wondering if python is a good language to build a web crawler
with? For example, to construct a program that will routinely search x
amount of sites to check the availability of a product. Or to search
for news articles containing the word 'XYZ'. These are just random
ideas to try to explain my question a bit further. Well if you have an
opinion about this please let me know becasue I am very interested to
hear what you have to say. Thanks.
 
A

Andrew Gwozdziewycz

I was wondering if python is a good language to build a web crawler
with? For example, to construct a program that will routinely search x
amount of sites to check the availability of a product. Or to search
for news articles containing the word 'XYZ'. These are just random
ideas to try to explain my question a bit further. Well if you have an
opinion about this please let me know becasue I am very interested to
hear what you have to say. Thanks.

Google supplies a basic webcrawler as a google desktop plugin called
Kongulo (http://sourceforge.net/projects/goog-kongulo/) which is
written in python. I would think python would be perfect for this sort
of application. Your bottleneck is always going to be downloading the
page.
 
T

Tempo

Why do you say that the bottleneck of the crawler will always be
downloading the page? Is it becasue there isn't already a modual to do
this and I will have to start from scratch? Or a bandwidth issue?
 
D

Diez B. Roggisch

Tempo said:
Why do you say that the bottleneck of the crawler will always be
downloading the page? Is it becasue there isn't already a modual to do
this and I will have to start from scratch? Or a bandwidth issue?

Because of bandwidth - not necessarily yours directly, but the maximum flow
between your uplink and the site in question. It will always take at least
a fractioin of a second up to several seconds until the data is there - in
that time, lots of python code can run.

Diez
 
T

Tempo

Does a web crawler have to download an entire page if it only needs to
check if the product is in stock on a page? Or if it just needs to
search for one match of a certain word on a page?
 
T

Tim Parkin

Tempo said:
Does a web crawler have to download an entire page if it only needs to
check if the product is in stock on a page? Or if it just needs to
search for one match of a certain word on a page?
Typically you would download the whole html file and then perform any
analysis on this. It is possible to parse the stream of characters as
they come back from the server but this would statistically only reduce
the download time by a half (presuming the item you want is of a single
byte in length and can appear anywhere in the html). In reality, unless
the pages you are requesting are very large (200k+) or your bandwidth
very expensive (in time and/or capacity) then it is probably easier for
you to just download the whole file.

I would recommend that you use BeautifulSoup to parse badly formatted
html documents (which is most of the web). (google 'beautiful soup' and
you should find it easily).

Tim Parkin
 
T

Tempo

I took your advice and got a copy of BeautifulSoup, but I am having
trouble installing the module. Any advice? I noticed that I just can't
put it into the 'lib' directory of python to install it.
 
T

Tim Parkin

Tempo said:
I took your advice and got a copy of BeautifulSoup, but I am having
trouble installing the module. Any advice? I noticed that I just can't
put it into the 'lib' directory of python to install it.
Just save the file in the same directory as your project then you should
be able to use the sample code.

Tim Parkin
 
P

Paul Rubin

Tempo said:
I was wondering if python is a good language to build a web crawler
with? For example, to construct a program that will routinely search x
amount of sites to check the availability of a product. Or to search
for news articles containing the word 'XYZ'. These are just random
ideas to try to explain my question a bit further.

I've written a few of these in Python. The language itself is fine
for this. The built-in libraries do most of what you'd hope, though
they have room for improvement. Generally I use urllib.read() to get
the whole html page as a string, then process it from there. I just
look for the substrings I'm interested in, making no attempt to
actually parse the html into a DOM or anything like that.
 
P

Paul Rubin

Xavier Morel said:
BeautifulSoup...
The API of the package is extremely simple, straightforward and... obvious.

I did not find that. I spent a few minutes looking at the
documentation and it wasn't obvious at all how to use it. Maybe I
could have figured it out with more effort, but I got whatever the
immediate task was done without it instead. It does look like a nice
package but the docs need improvement.
 
X

Xavier Morel

Paul said:
Generally I use urllib.read() to get
the whole html page as a string, then process it from there. I just
look for the substrings I'm interested in, making no attempt to
actually parse the html into a DOM or anything like that.
>
BeautifulSoup works *really* well when you want to parse the source
(e.g. when you don't want to use string matching, or when the structures
you're looking for are a bit too complicated for simple string
matching/substring search)

The API of the package is extremely simple, straightforward and... obvious.
 
T

Tempo

I agree. I think the way that I will learn to use most of it is by
going through the source code.
 
M

Magnus Lycka

Tempo said:
I was wondering if python is a good language to build a web crawler
with? For example, to construct a program that will routinely search x
amount of sites to check the availability of a product. Or to search
for news articles containing the word 'XYZ'. These are just random
ideas to try to explain my question a bit further. Well if you have an
opinion about this please let me know becasue I am very interested to
hear what you have to say. Thanks.

I dunno, but there are these two guys, Sergey Brin and Lawrence Page,
who wrote a web crawler in Python. As far as I understood, they were
fairly successful with it. I think they called their system Koogle,
Bugle, or Gobble or something like that. Goo...can't remember.

See http://www-db.stanford.edu/~backrub/google.html

They've also employed some clever Python programmers, such as Greg
Stein, Alex Martelli (isn't he a bot?) and some obscure dutch
mathematician called Guido van something. It seems they still like
Python.
 
A

Alex Martelli

Magnus Lycka said:
I dunno, but there are these two guys, Sergey Brin and Lawrence Page,
who wrote a web crawler in Python. As far as I understood, they were
fairly successful with it. I think they called their system Koogle,
Bugle, or Gobble or something like that. Goo...can't remember.

See http://www-db.stanford.edu/~backrub/google.html

Yeah, I've heard of them, too.

They've also employed some clever Python programmers, such as Greg
Stein, Alex Martelli (isn't he a bot?) and some obscure dutch
mathematician called Guido van something. It seems they still like
Python.

Bot? me? did I fail a Turing test again without even noticing?!


Alex
 
M

Magnus Lycka

Simon said:
If you'd noticed the test, you'd have passed.

No no, it's just a regular expression that notices the
word 'bot' close to 'Martelli'. Wouldn't surprise me
if more or less the same message appears again as a
response to this post. ;)
 
J

John J. Lee

Yeah, I've heard of them, too.

I wonder if that little outfit has considered open-sourcing any of
their web client code?

(Declaring my interest: I'm maintaining, and very slowly developing,
some open-source libraries for web scraping and testing)


John
 
G

gene tani

Paul said:
I did not find that. I spent a few minutes looking at the
documentation and it wasn't obvious at all how to use it. Maybe I

1. read about Soup and mechanize
http://sig.levillage.org/?p=599

2. flip thru oreilly spidering hacks book (put on YAPH t-shirt)

3. go at your task

4. write Spidering Hacks in Python, 1st edition. Cite me as
inspiration.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,233
Members
46,820
Latest member
GilbertoA5

Latest Threads

Top