parsing non-well-formed XML (SAX)

T

Timo Nentwig

Hi!

I need to parse multi-MByte "XML" files which are not well-formed, i.e.
there's are plenty of <TAGS> in there instead of <TAGS />. I'm also not
sure about case sensitiveness.

Any ready-to-use solutions? :)

Timo
 
A

Andy Fish

well I shouldn't think there are any XML parsers you can use.

the trouble with not well formed documents is that only you will know what
types of non-well-formedness are acceptable and how to interpret them - Any
piece of information that is not a well-formed XML document is a badly
formed XML document!!

So, the key to a successful solution is to write down what your definition
of a valid input document is. only once you have done this can you evaluate
different approaches.

if there are only a few well-known examples of badly formed tags you could
pre-process it first to generate XML. e.g. say you knew that the TAGS
element could never have any content but it might be missing the end-tag
delimiter (like the <br> in HTML) it would be easy to pick it up.

Failing that, antlr is a well known parser generator which would be a
builing block on the way to making your own parser.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top