web crawler in python or C?

A

abhinav

Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also.I want to strke a balance between development speed and
crawler speed.Since Python is an interpreted language it is rather
slow.The crawler which will be working on huge set of pages should be
as fast as possible.One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds.But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?
 
W

websnarf

abhinav said:
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also. I want to strke a balance between development speed and
crawler speed.

Web crawling is an inherently network limited activity. The way to
speed up crawling is through parallel downloading. The language
performance is not going to have a relevant effect. Python does not
support multithreading, but it does support weak coroutines. (Of
course, C does not support any kind of multithreading, except by
platform specific extensions -- but these extensions are widespread.)

For the problem of parsing and handling data structures for this
activity, however, Python is *FAR* superior to C in terms of
development speed.
[...] Since Python is an interpreted language it is rather
slow.The crawler which will be working on huge set of pages should be
as fast as possible.One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds. But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?

Actually, I have, in fact, done it this way myself in the past (before
Python had weak coroutines.) The way I did it is I wrote a
command-line tool for pulling down a collection of URLs from a control
file in C (the URLs would be downloaded in a multithreaded manner),
then I drove this tool from a Python program. Asymptotically, this
pegs my download bandwidth for the majority of the runtime, thus making
it basically within striking distance of theoretically optimal.

The problem is that you've picked completely the wrong newsgroup to ask
this question. Unfortunately, there is not clue to this fact from the
name of this newsgroup. This is actually a newsgroup that discusses
only the ANSI/ISO C standard as it exists, and none of platform
specific extensions (including sockets, and multithreading). Nor is
the discussion of the development of real applications considered
on-topic in this newsgroup. Neither is performance considered on topic
-- by the standard, apparently you can't know even the *relative* speed
of anything in C. comp.programming would probaby have been a better
place to post this.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top