Looking for Java web crawler api

P

pm

Hello, I am working on a project that requires me to do custom search on
different websites. I am using Java and while I can write this from
ground up, I am looking at using existing APIs that can be used due to
time limit. So far I have came across Apache's HttpClient.
I am wondering if there are any others that can be effective or
give more options for web searching/scraping. I plan to create a GUI
based application and need something quick and effective while not being
too complex.
I appreciate any feedback.
 
B

Bent C Dalager

I found JSoup (jsoup.org) to be a fine library for web scraping. It
lets you easily set cookies and headers, fetches the URL for you, and
converts the tangled mess of HTML you tend to receive into a
well-formed XML document model.

Cheers,
Bent D.
 
D

Durango2011

I found JSoup (jsoup.org) to be a fine library for web scraping. It lets
you easily set cookies and headers, fetches the URL for you, and
converts the tangled mess of HTML you tend to receive into a well-formed
XML document model.

Cheers,
Bent D.

Thank you very much that looks like what I am looking for.
 
I

iadb

Hello, I am working on a project that requires me to do custom search on
different websites.  I am using Java and while I can write this from
ground up, I am looking at using existing APIs that can be used due to
time limit.  So far I have came across Apache's HttpClient.  
        I am wondering if there are any others that can be effective or
give more options for web searching/scraping. I plan to create a GUI
based application and need something quick and effective while not being
too complex.
I appreciate any feedback.

Look at the attached example, it works fine with little
customization..
http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/


http://www.internetarticlesdb.com
 
D

Durango2011

On Tue, 12 Jul 2011 07:14:45 +0000, pm wrote:


Thanks for all the great feedback :)
 
A

Arne Vajhøj

Hello, I am working on a project that requires me to do custom search on
different websites. I am using Java and while I can write this from
ground up, I am looking at using existing APIs that can be used due to
time limit. So far I have came across Apache's HttpClient.
I am wondering if there are any others that can be effective or
give more options for web searching/scraping. I plan to create a GUI
based application and need something quick and effective while not being
too complex.

http://nutch.apache.org/ should contain a crawler and it comes with
a searchable database (Lucene).

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top