Parsing HTML without Perl

T

TLOlczyk

Is there a library that will allow me to parse bad HTML?
( good html too, but any lib will do that ).

Yes I can use Perl, but I want the flexibility to use any one
of several languages. So a shared object/dll would be best.
I've looked at libxml ( actually libxml2 ) and expat ( I know
they are really XML parsers, but one can hope ), and neither
handles HTML well enough. I'm totally confused by libwww.
The libwww people suggest looking at the parser in Amaya,
but I don't know how good it is or if I can extract it from
the rest of Amaya.

Suggestions?


The reply-to email address is (e-mail address removed).
This is an address I ignore.
To reply via email, remove 2002 and change yahoo to
interaccess,

**
Thaddeus L. Olczyk, PhD

There is a difference between
*thinking* you know something,
and *knowing* you know something.
 
D

David Christopher Weichert

Is there a library that will allow me to parse bad HTML?
( good html too, but any lib will do that ).

Yes I can use Perl, but I want the flexibility to use any one
of several languages. So a shared object/dll would be best.
I've looked at libxml ( actually libxml2 ) and expat ( I know
they are really XML parsers, but one can hope ), and neither
handles HTML well enough. I'm totally confused by libwww.
The libwww people suggest looking at the parser in Amaya,
but I don't know how good it is or if I can extract it from
the rest of Amaya.

Suggestions?

Georg Rehm describes (in: Mehler & Lobin: Automatische Textanalyse,
2004) a two step process for converting arbitrary HTML Webpages to
XHTML. According to him it works in 98.7 % of all cases:

1) use tidy to read and try to convert the HTML to XHTML

2) if 1) fails they use HTML::Treebuilder (Perl module, see:
http://www.cpan.org) and then again tidy.

For 10000 files picked at random 9872 could be successfully converted in
this fashion. Only 5 of the resulting files were not wellformed afterwards
(tested with expat).


Kind regards
David
 
T

TLOlczyk

Georg Rehm describes (in: Mehler & Lobin: Automatische Textanalyse,
2004) a two step process for converting arbitrary HTML Webpages to
XHTML. According to him it works in 98.7 % of all cases:

1) use tidy to read and try to convert the HTML to XHTML
No. Tidy chokes on embedded < > and a few other bad constructs.
At least it did last I looked. I will have to try it again.
Also it comes as a standalone application. If i were to use
something like tidy, I would preffer it to be a lib ( though I
understand they are working on it ).

Ps: I never tried it with javascript. How does it handle that.
Both expat and libxml choke on the first for loop.
2) if 1) fails they use HTML::Treebuilder (Perl module, see:
http://www.cpan.org) and then again tidy.
As I said before I would like to be independent of Perl.



The reply-to email address is (e-mail address removed).
This is an address I ignore.
To reply via email, remove 2002 and change yahoo to
interaccess,

**
Thaddeus L. Olczyk, PhD

There is a difference between
*thinking* you know something,
and *knowing* you know something.
 
D

David Christopher Weichert

No. Tidy chokes on embedded < > and a few other bad constructs.
At least it did last I looked. I will have to try it again.
Also it comes as a standalone application. If i were to use
something like tidy, I would preffer it to be a lib ( though I
understand they are working on it ).

tidy also comes as a lib.
Ps: I never tried it with javascript. How does it handle that.

Just tried tidy on a randomly picked page with JavaScript page
(http://javascript.internet.com/). Tidy reported warnings and errors, but
could not handle them automatically. Seems Rehm used easier examples. I
can't say whether this behaviour is down to JavaScript or other stuff
wrong with that particular file.
Both expat and libxml choke on the first for loop.

As I said before I would like to be independent of Perl.
Rehm states that he used HTML::Treebuilder only in 2.7 % of all cases and
that otherwise tidy was sufficient. This may have to do with the fact that
the pages he sampled seemingly worked better with tidy than the random
sample I picked. (Rehm sampled pages from German educational institutions).

Looks like tidy on its own is not the solution, but might be of some use.


Good luck
David
 
T

TLOlczyk

tidy also comes as a lib.


Just tried tidy on a randomly picked page with JavaScript page
(http://javascript.internet.com/). Tidy reported warnings and errors, but
could not handle them automatically. Seems Rehm used easier examples. I
can't say whether this behaviour is down to JavaScript or other stuff
wrong with that particular file.
That's because with anything but the most simple Javascript,
you are going to encounter something like:
for (var i=0; i < something; i++)
which is going to choke any XML based parser.
You can't change it to:
for (var i=0; i &lt; something; i++)
because that will screw with any javascript interpreter.

If you want to really parse HTML. You need to pick out
javascript on the fly.
Rehm states that he used HTML::Treebuilder only in 2.7 % of all cases and
that otherwise tidy was sufficient. This may have to do with the fact that
the pages he sampled seemingly worked better with tidy than the random
sample I picked. (Rehm sampled pages from German educational institutions).

Looks like tidy on its own is not the solution, but might be of some use.
No. Tidy as it works now, is a recipe for disaster. There are a few
known exceptions, so you spend a lot of time writing code to get
around the exceptions. More and more exceptions popup requiring
more and more complex code. Till your project collapses from the
weight of maintaining all the exceptions becomes so large that
it takes up all your time.

The answer to the problem is not to use a program which
can take relatively good HTML and produce really good HTML.

The solution to the problem is to start with a parser that can handle
badly formed but serivicable HTML in the first place.

Which points out to me that we have gone far afield.

So back to my main question:
Anyone out there know of an HTML parser that comes as a shared
object/DLL?


The reply-to email address is (e-mail address removed).
This is an address I ignore.
To reply via email, remove 2002 and change yahoo to
interaccess,

**
Thaddeus L. Olczyk, PhD

There is a difference between
*thinking* you know something,
and *knowing* you know something.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top