Open source web crawler with mysql integration

D

dhenews

I'm looking for a crawler that can spider my site and toss the results
into mysql so, in turn, that database can be indexed by Sphinx Search.

Since I don't want to reinvent the wheel, is anyone aware of any open
source projects or code snippets that can already handle this?

Thanks for any advice.
 
D

Daniel Fetchinson

I'm looking for a crawler that can spider my site and toss the results
into mysql so, in turn, that database can be indexed by Sphinx Search.

Since I don't want to reinvent the wheel, is anyone aware of any open
source projects or code snippets that can already handle this?

Have a look at http://nikitathespider.com/python/

HTH,
Daniel
 
P

Philip Semanchuk



As the author of Nikita, I can say that (a) she used Postgres and (b)
the code wasn't open sourced except for a couple of small parts. The
service is now defunct. It wasn't making money. Ideally I'd like to
open source the code one day, but it would take a lot of documentation
work to make it installable by others, and I won't have the time to do
that for the foreseeable future.

At the URL provided there's a nice module for parsing robots.txt files
(better than the one in the standard library IMHO) but that's about it.

FYI, I wrote my spider in Python because I couldn't find a decent one
written in Python. There's Nutch, but that's not Python (Java I think).

Good luck
Philip
 
S

Support Desk

Sounds Interesting. When its done would you care to share it?

Sincerely,
Michael H.

-----Original Message-----
From: Philip Semanchuk [mailto:p[email protected]]
Sent: Thursday, April 09, 2009 9:46 PM
To: Python
Subject: Re: Open source web crawler with mysql integration




As the author of Nikita, I can say that (a) she used Postgres and (b)
the code wasn't open sourced except for a couple of small parts. The
service is now defunct. It wasn't making money. Ideally I'd like to
open source the code one day, but it would take a lot of documentation
work to make it installable by others, and I won't have the time to do
that for the foreseeable future.

At the URL provided there's a nice module for parsing robots.txt files
(better than the one in the standard library IMHO) but that's about it.

FYI, I wrote my spider in Python because I couldn't find a decent one
written in Python. There's Nutch, but that's not Python (Java I think).

Good luck
Philip
 
P

Philip Semanchuk

Sounds Interesting. When its done would you care to share it?

Hi Michael,
The coding has been done (as much as software is ever "done") for a
couple of years now. It's mothballed now, sitting on my hard drive.
The problem with open sourcing it isn't that the code is incomplete,
the problem is that it's insufficiently documented, features a
byzantine install procedure and contains a lot of code & assumptions
that were relevant to my business but would not be of interest to most
people looking to download a general-purpose spider. I'd love to open
source it and if someone wants to pay me to make it open source-able,
let's talk! But if I have to do it on my own time for free it will be
a while (maybe never, although I hope not) before I can make the time.

Regards
Philip



-----Original Message-----
From: Philip Semanchuk [mailto:p[email protected]]
Sent: Thursday, April 09, 2009 9:46 PM
To: Python
Subject: Re: Open source web crawler with mysql integration




As the author of Nikita, I can say that (a) she used Postgres and (b)
the code wasn't open sourced except for a couple of small parts. The
service is now defunct. It wasn't making money. Ideally I'd like to
open source the code one day, but it would take a lot of documentation
work to make it installable by others, and I won't have the time to do
that for the foreseeable future.

At the URL provided there's a nice module for parsing robots.txt files
(better than the one in the standard library IMHO) but that's about
it.

FYI, I wrote my spider in Python because I couldn't find a decent one
written in Python. There's Nutch, but that's not Python (Java I
think).

Good luck
Philip
 
L

Lawrence D'Oliveiro

Philip said:
I'd love to open source it and if someone wants to pay me to make it open
source-able, let's talk!

Nobody's going to pay you for something of doubtful value--it's up to you to
prove the value of the code first. You must go to the community, the
community will not come to you.
 
P

Philip Semanchuk

In message <[email protected]>,


Nobody's going to pay you for something of doubtful value--it's up
to you to
prove the value of the code first. You must go to the community, the
community will not come to you.

Not true, people pay for things of doubtful value all the time! I just
need a better sales team. =)

Seriously, if I had expectations of talking someone into fronting
money, do you think I'd use words like "insufficiently documented" and
"byzantine"? I was just trying to emphasize what I thought was an
obvious point: I can't afford the time to open source my code right
now, but if someone were to make it worth my while, that'd be a
different story. I may as well have said "if I win the lottery". It
*could* happen, but I'm not holding my breath (or buying lottery
tickets).

Cheers
Philip
 
D

Daniel Fetchinson

Not true, people pay for things of doubtful value all the time! I just
need a better sales team. =)

Seriously, if I had expectations of talking someone into fronting
money, do you think I'd use words like "insufficiently documented" and
"byzantine"? I was just trying to emphasize what I thought was an
obvious point: I can't afford the time to open source my code right
now, but if someone were to make it worth my while, that'd be a
different story. I may as well have said "if I win the lottery". It
*could* happen, but I'm not holding my breath (or buying lottery
tickets).

Cheers
Philip

This is what http://www.uselesspython.com/ is for! You can dump it
there, interested people can be pointed to it, but you expressly say
that "it's useless!" so your reputation will not be damaged. Actually,
since you say the code is working and there are not many good open
source web crawlers out there, I'm sure people will be quite happy
with it. Your worry that 90% of the people will only be frustrated by
it will not happen I think if you put it on uselesspython because the
expectations will not be high anyway.

Cheers,
Daniel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top