XML Parser VS HTML Parser

ZOCOR · Oct 3, 2004

Hi

Can a XML parser be used to parse a HTML document? even if it is not
well-formed?

If the answer is yes to both, can you recommend a Java XML parser class
(from the standard API)?

Cheers

ZOCOR

Sudsy · Oct 3, 2004

ZOCOR said:
Hi

Can a XML parser be used to parse a HTML document? even if it is not
well-formed?

No; an XML parser will balk on a lot of HTML. It's not well-formed.

If the answer is yes to both, can you recommend a Java XML parser class
(from the standard API)?

Search the archives for alternate approaches.

[private] · Oct 3, 2004

ZOCOR said:
Can a XML parser be used to parse a HTML document? even if it is not
well-formed?

It can parse it as long as the HTML is well-formed. XML isn't as
relaxed as HTML, so any open elements will throw an exception (probably
org.xml.sax.SAXException, but can't verify right now).

Martin Honnen · Oct 3, 2004

ZOCOR wrote:

Can a XML parser be used to parse a HTML document? even if it is not
well-formed?

No, an XML parser can't parse HTML, unless of course it is XHTML. But
HTML 3.2 or HTML 4.01 cannot be parsed with an XML parser.

Darryl L. Pierce · Oct 3, 2004

ZOCOR said:
Can a XML parser be used to parse a HTML document? even if it is not
well-formed?

A SAX or DOM parser will throw exceptions on data that's not well-formed.
So, the answer is no, it cannot.

--
/**
* @author Darryl L. Pierce <[email protected]>
* @see The Infobahn Offramp <http://mcpierce.mypage.org>
* @quote "Lobby, lobby, lobby, lobby, lobby, lobby..." - Adrian Monk
*/

Tor Iver Wilhelmsen · Oct 3, 2004

It can parse it as long as the HTML is well-formed.

Except for XHTML, HTML cannot be assumed to be well-formed since HTML
does not "end" empty elements properly; they are only empty by
implication, like <br>.

Also, real-world HTML is packed full of implicit begin and end tags a
parser needs to be aware of.

CarlosRivera · Oct 3, 2004

You could use tidy or similar to turn html into xhtml and then use an
XML parser.

ZOCOR · Oct 4, 2004

Darryl L. Pierce said:
A SAX or DOM parser will throw exceptions on data that's not well-formed.
So, the answer is no, it cannot.

Well i can catch the exceptions so that processing can continue?

Whats the problem?

ZOCOR

Tor Iver Wilhelmsen · Oct 4, 2004

ZOCOR said:
Whats the problem?

<br> and the like, which are (implicitly) empty elements that a SAX
parser will not report an end element for, since they are start tags
for containing elements as far as the parser knows.

So you need to add a bunch of logic that handles optional start
elements, implicit end elements, and non-terminated empty elements.

But, hey, if you don't consider that a problem...

ZOCOR · Oct 4, 2004

Whats the problem?

<br> and the like, which are (implicitly) empty elements that a SAX
parser will not report an end element for, since they are start tags
for containing elements as far as the parser knows.

So you need to add a bunch of logic that handles optional start
elements, implicit end elements, and non-terminated empty elements.

But, hey, if you don't consider that a problem...

Well im only after specific text contained in certain tags, which
fortunately have an end tag for. As for the other tags, I couldn't give 2
rats about.

ZOCOR

Brusque · Oct 4, 2004

ZOCOR said:
Hi

Can a XML parser be used to parse a HTML document? even if it is not
well-formed?

If the answer is yes to both, can you recommend a Java XML parser class
(from the standard API)?

Cheers

ZOCOR

Never used it myself, but maybe this is worth a try:
http://www.apache.org/~andyc/neko/doc/html/

Paul King · Oct 5, 2004

Brusque said:
Never used it myself, but maybe this is worth a try:
http://www.apache.org/~andyc/neko/doc/html/

CyberNeko HTML Parser (above link) works well in my experience. If that
doesn't suit, you might like to try tagsoup (which also works well):
http://mercury.ccil.org/~cowan/XML/tagsoup/

If you find them too heavy weight, regex might be what you are after.

Cheers, Paul.

XML parser: Element ordering?	0	Aug 31, 2012
parser ( based on position)	1	Apr 4, 2012
simple ElementTree based parser that allows entity definition map	0	Dec 4, 2013
perl html parser	1	Nov 11, 2010
A good (X)HTML parser	2	May 20, 2008
WSDL Type Parser	3	Apr 12, 2007
How to remove an empty line which is created when i deleted a element from my xml file?	0	Oct 1, 2016
XML Parser	1	Jul 10, 2007

XML Parser VS HTML Parser

ZOCOR

Sudsy

[private]

Martin Honnen

Darryl L. Pierce

Tor Iver Wilhelmsen

CarlosRivera

ZOCOR

Tor Iver Wilhelmsen

ZOCOR

Brusque

Paul King

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads