Java HTML Parser

A

Anony!

Hi

2 questions:

1. I'm looking for a Java HTML parser. I realize that the Java Swing HTML
parser is one option I could use, but I would like some other
opinions/alternatives.

2. I am hoping to parse a batch of HTML Web pages. I believe it should be
relatively easy to do a single HTML page, but any tips for multiple HTML
pages? how will the parser know to go to the next HTML page? I have like
thousands of HTML pages to parse.

Any help appreciated.


Regards
AaA
 
L

Luca Paganelli

2. I am hoping to parse a
batch of HTML Web pages. I
believe it should be
relatively easy to do a
single HTML page, but any tips
for multiple HTML
pages? how will the parser
know to go to the next HTML
page? I have like
thousands of HTML pages to
parse.

I don't think the parse would
go to 'next HTML pages'
automatically.
Anyway you can look for any
linked page in the parsed
document and
then start parsing those new
pages.

Luca Paganelli
 
A

Anony!

batch of HTML Web pages. I
believe it should be
single HTML page, but any tips
for multiple HTML
know to go to the next HTML
page? I have like
parse.

I don't think the parse would
go to 'next HTML pages'
automatically.
Anyway you can look for any
linked page in the parsed
document and
then start parsing those new
pages.

Luca Paganelli

There are no links in those HTML pages. And yes I want something that will
automatically parse the next HTML file in a given directory.

AaA
 
M

Markus Schaber

Hi, Anony,

There are no links in those HTML pages. And yes I want something that
will automatically parse the next HTML file in a given directory.

Then you use the java.io API to iterate over the filesystem and parse
the files one after another, should be less then 20 lines of code.

Gruss,
Markus
 
A

Anony!

Hi, Anony,

There are no links in those HTML pages. And yes I want something that
will automatically parse the next HTML file in a given directory.

Then you use the java.io API to iterate over the filesystem and parse
the files one after another, should be less then 20 lines of code.

Gruss,
Markus

You mean store the files in a tree structure? and iterate through it?

AaA
 
M

Markus Schaber

Hi, Anony,

You mean store the files in a tree structure? and iterate through it?

Why a tree structure?

You create a File object on the Directory, (isDirectory() should be true
then), and use listFiles(filter) to get a List of all files of this
Directory, then you can pass each of them to your html parser.

Markus
 
R

Rogan Dawes

Anony! said:
Hi

2 questions:

1. I'm looking for a Java HTML parser. I realize that the Java Swing HTML
parser is one option I could use, but I would like some other
opinions/alternatives.

have a look at htmlparser on sourceforge.net
(http://htmlparser.sourceforge.net), which is probably more robust than
the standard Sun parser.
2. I am hoping to parse a batch of HTML Web pages. I believe it should be
relatively easy to do a single HTML page, but any tips for multiple HTML
pages? how will the parser know to go to the next HTML page? I have like
thousands of HTML pages to parse.

Either you have a list of the pages/URLs that you provide to the parser,
or you parse additional URL's from the pages as you read them. As you
said in another response in this thread that the pages will not have
links to other pages, you must then have a list yourself.

Clearly, your computer cannot simply "guess" which pages to parse. If
the pages are stored locally, simply iterate over the directory(ies) in
which they are stored, parsing them one by one. If the pages are stored
on a server, perhaps there is an index page that you can parse to get a
list of pages.

Rogan
 
A

Anony!

have a look at htmlparser on sourceforge.net
(http://htmlparser.sourceforge.net), which is probably more robust than
the standard Sun parser.


Either you have a list of the pages/URLs that you provide to the parser,
or you parse additional URL's from the pages as you read them. As you
said in another response in this thread that the pages will not have
links to other pages, you must then have a list yourself.

Clearly, your computer cannot simply "guess" which pages to parse. If
the pages are stored locally, simply iterate over the directory(ies) in
which they are stored, parsing them one by one. If the pages are stored
on a server, perhaps there is an index page that you can parse to get a
list of pages.

Let me describe what I am trying to parse in greater detail.

I have a Webpage that has a list of hyperlinks. Each of these hyperlinks
point to another page with a list of hyperlinks. Each of these links point
to a unique page I want to parse. Anyone lost yet? I don't know how to
download all of these pages in an automated fashion for parsing.

I think I can handle storing these files in a file system recognise in Java
and then parsing each of these files in the file system. Its the automated
download of all these webpages that makes me clueless.

Any help appreciated.

AaA
 
W

William Brogden

Let me describe what I am trying to parse in greater detail.

I have a Webpage that has a list of hyperlinks. Each of these hyperlinks
point to another page with a list of hyperlinks. Each of these links
point
to a unique page I want to parse. Anyone lost yet? I don't know how to
download all of these pages in an automated fashion for parsing.

I think I can handle storing these files in a file system recognise in
Java
and then parsing each of these files in the file system. Its the
automated
download of all these webpages that makes me clueless.

Any help appreciated.

AaA

You might find the code to JTidy to be useful
http://sourceforge.net/projects/jtidy

"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty
printer. Like its non-Java cousin, JTidy can be used as a tool for
cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM
parser for real-world HTML."

Bill
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,998
Messages
2,570,242
Members
46,835
Latest member
lila30

Latest Threads

Top