...How to parse search engine results fast?

V

VB

Hi,

I'm building a metaseach engine based on data mining techniques....but
this is not important...

My question is about performances of the activity of scraping search
engine results from an HTML response page.

I see that some metasearch engines (Mamma, DogPile, Vivisimo & C.)
present top 50 results of 3-5 search engines in about 1 seconds.

With my perl script I am able to retrieve top 100 results of Google in
about 1,5 seconds, but from only one search engine!

Somebody (very much skilled in Perl) can tell me some advanced
technique (parallelism, thread...bo?) to retrieve from 3-5 search
engines very fast? (Hardware not included in this issue, I have a fast
hardware)


Excuse me for my english (I'm italian) and for my poor Perl skills.

Thanks,

VB
 
P

phaylon

VB said:
My question is about performances of the activity of scraping search
engine results from an HTML response page.

Maybe they /asked/ and used the API provided by some SE's?
Somebody (very much skilled in Perl) can tell me some advanced technique
(parallelism, thread...bo?) to retrieve from 3-5 search engines very fast?
(Hardware not included in this issue, I have a fast hardware)

- Very fast hardware with enough resources.
- Internet Connection
- Interfaces, see above.

There may be much more, but I can't see the Perl relation (I've written in
Perl may be not enough, this group is more on coding Perl, not about
technologies which can be coded with Perl, would be a wide field.);

hth,
p
 
A

Anno Siegel

VB said:
Hi,

I'm building a metaseach engine based on data mining techniques....but
this is not important...

Then why mention it?
My question is about performances of the activity of scraping search
engine results from an HTML response page.

Looks like you use "scraping" to mean, roughly, parsing.

The answer would depend on the format of the response page. Since you
don't mention which search engines you are tapping into, there is
nothing we can say about that.

Except that the time needed to parse the results will most likely
be small compared to the time taken to retrieve them.
I see that some metasearch engines (Mamma, DogPile, Vivisimo & C.)
present top 50 results of 3-5 search engines in about 1 seconds.

With my perl script I am able to retrieve top 100 results of Google in
about 1,5 seconds, but from only one search engine!

Well, Perl isn't the fastest of languages. If you want super-fast,
don't use Perl.
Somebody (very much skilled in Perl) can tell me some advanced
technique (parallelism, thread...bo?) to retrieve from 3-5 search
engines very fast? ...

Finally you get to the core of your question. The one thing you can
do to arrive at results faster is to handle multiple queries in
parallel. See "perldoc perlipc" for general techniques, and "perldoc
-f fork" and "perldoc -f open" for the basic methods. Also check out
LWP::parallel on CPAN, it could be useful.
... (Hardware not included in this issue, I have a fast hardware)

Again, your local processing speed will not be the limiting factor. Net
delay and the vastly more extensive processing on the actual search engine
will.

Anno
 
G

Gregory Toomey

VB said:
Hi,

I'm building a metaseach engine based on data mining techniques....but
this is not important...

My question is about performances of the activity of scraping search
engine results from an HTML response page.

Use C or possibly Perl, but that's not your problem.

The problem is the copyright lawsuit thats heading your way.

gtoomey
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top