developing web spider

D

Daniel Fetchinson

I would want to know which could be the best programming language for
developing web spider.
More information about the spider, much better,,

I hear Larry and Sergei were not exactly unsuccessful with a python
implementation although you might of course try something even better
:)

If you are less ambitious have a look at http://nikitathespider.com/
which is also a spider in python.

Cheers,
Daniel
 
S

Stefan Scholl

abeen said:
I would want to know which could be the best programming language for
developing web spider.

Since you ask in comp.lang.python: I'd suggest APL
 
Z

zillow10

Hello,

I would want to know which could be the best programming language for
developing web spider.
More information about the spider, much better,,

thanks

http://www.imavista.com

Just saw this while passing by... There's a nice book by Michael
Schrenk (www.schrenk.com) called "Webbots, Spiders and Screen
Scrapers" that teaches scraping and spidering from the ground up using
PHP. Since you said you want more info on spiders, this book might be
a good way for you to acquire concept and implementation hand-in-hand.
He's also developed a nice webbot library in PHP that you can get from
his website.

Also comes with a nice webbot library (which you can download from
the website anyway).
 
Z

zillow10

Just saw this while passing by... There's a nice book by Michael
Schrenk (www.schrenk.com) called "Webbots, Spiders and Screen
Scrapers" that teaches scraping and spidering from the ground up using
PHP. Since you said you want more info on spiders, this book might be
a good way for you to acquire concept and implementation hand-in-hand.
He's also developed a nice webbot library in PHP that you can get from
his website.

Also comes with a nice webbot library (which you can download from
the website anyway).

Sorry for the duplicate comment about the webbot library... the perils
of cutting and pasting to restructure sentences. :)
 
P

Pete Wright

The O'Reilly Spidering Hacks book is also really good, albeit a little
too focussed on Perl.
 
J

John Nagle

abeen said:
Hello,

I would want to know which could be the best programming language for
developing web spider.
More information about the spider, much better,,

As someone who actually runs a Python based web spider in production, I
should comment.

You need a very robust parser to parse real world HTML.
Even the stock version of BeautifulSoup isn't good enough. We have a
modified version of BeautifulSoup, plus other library patches, just to
keep the parser from blowing up or swallowing the entire page into
a malformed comment or tag. Browsers are incredibly forgiving in this
regard.

"urllib" needs extra robustness, too. The stock timeout mechanism
isn't good enough. Some sites do weird things, like open TCP connections
for HTTP but not send anything.

Python is on the slow side for this. Python is about 60x
slower than C, and for this application, you definitely see that.
A Python based spider will go compute bound for seconds per page
on big pages. The C-based parsers for XML/HTML aren't robust enough for
this application. And then there's the Global Interpreter Lock; a multicore
CPU won't help a multithreaded compute-bound process.

I'd recommend using Java or C# for new work in this area
if you're doing this in volume. Otherwise, you'll need to buy
many, many extra racks of servers. In practice, the big spiders
are in C or C++.

Lose the ad link.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,996
Messages
2,570,237
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top