Hacks for parsing non well-formed XML ?

Andy Dingley · Mar 16, 2007

Given this badly-formed fragment, any suggestions on how best to parse
it?

[...]
<dc:title><Browse By Subject></dc:title>
[...]

The minimal problem is "unexpected < character at the beginning of
character data"

I don't know how it arises. I suspect that it's a character string
with "<" in that isn't being encoded properly. Although it might be
some crazy tag-name getting squirted into the wrong end of the XML
generator. Anyway, it's the badly-formed output of a major bluechip
dot-com and it's likely to stay that way. Our problem is how to chow
down on it, despite its bad formation. 8-(

It's not too important to preserve the content here. The good stuff is
elsewhere in the document, this is just grit in the way.

So, any suggestions on how best to abuse XML standards or tools and
get it parsed with minimum work?

I've wondered about hacking to recognise tag closure as being
triggered by any whitespace, or by discarding starttags that aren't
from a small known list. I don't much like either though. Most robust
so far seems to be a parser where "<dc:title>" becomes part of the
syntax itself and has special handling. Any better ideas?

Richard Tobin · Mar 16, 2007

Andy Dingley said:
Given this badly-formed fragment, any suggestions on how best to parse
it?

<dc:title><Browse By Subject></dc:title>
[...]

I've wondered about hacking to recognise tag closure as being
triggered by any whitespace, or by discarding starttags that aren't
from a small known list.

You could make a pass through to determine probably-legal element
names, by looking for end tags. "</Browse" is much less likely to
occur than "<Browse". Then escape less-thans that don't precede an
element name for which you found a plausible end tag. Empty tags
are less clear cut, but you could probably find a 99% solution.

-- Richard

Joe Kesselman · Mar 16, 2007

Andy said:
Given this badly-formed fragment, any suggestions on how best to parse
it?

Best suggestions I've got are:

1) XML tools won't touch this. Write a text-processing layer which finds
and fixes these abuses before even thinking about it as XML. It's going
to be messy, fragile, ad-hoc programming.

2) Fix the code that generates it. Seriously. This is going to be an
ongoing hassle, and cost, until you do.

Andy Dingley · Mar 16, 2007

2) Fix the code that generates it. Seriously. This is going to be an
ongoing hassle, and cost, until you do.

It's! a! big! famous! dotcom! not! my! own! code!
(Can you guess who it is yet?)

Do You Snafu!

Simon Brooke · Mar 16, 2007

Andy said:
Given this badly-formed fragment, any suggestions on how best to parse
it?

[...]
<dc:title><Browse By Subject></dc:title>
[...]

The minimal problem is "unexpected < character at the beginning of
character data"

sed 's/<Browse By Subject>//'

There's no particular reason why you shouldn't use old and proven text
manipulation tools on XML.

Joseph Kesselman · Mar 16, 2007

It's! a! big! famous! dotcom! not! my! own! code!

Talk! To! Them! About! It!.

Though you may find that this is a deliberate poison-pill to prevent
unauthorized folks mining their servers... in which case you should
probably be talking to them about getting more official access, since
they're probably changing the poison on a regular basis and anything you
attempt to do to bypass it is likely to break again in a few weeks.

Peter Flynn · Mar 16, 2007

Andy said:
It's! a! big! famous! dotcom! not! my! own! code!
(Can you guess who it is yet?)

Do You Snafu!

Nevertheless, charge them extra and mark it on the invoice as overhead
for manual handling of non-XML material. If they're that big, they'll
pay, and if they're that stupid, they'll continue to pay you rather than
fix the bug.

///Peter

Andy Dingley · Mar 19, 2007

Though you may find that this is a deliberate poison-pill to prevent
unauthorized folks mining their servers...

Oh, I _wish_ they were that smart.

Just to clarify, it's a public interface to their services that they
encourage(sic) the use of. The likelihood of them fixing it is on the
avian-pig scale. It's also not a static string, so any sed-ing would
need a slightly more sophisticated regex to work on it, although it's
entirely viable. Sadly it's also an embedded app, so Unix tools just
aren't present. A similar pre-processor approach seems best though,
rather than frobbing a parser.

Thanks for all your suggestions.

Allowing mis-matched tags (non-well-formed XML)	7	Apr 3, 2007
DTDs and XML: another "not well formed" question	2	Jul 1, 2007
XML not well formed and UTF-8 encoding	8	Jan 19, 2007
Well-formed XML question	3	Jun 30, 2003
parsing non-well-formed XML (SAX)	2	Jun 4, 2004
XML 1.x: URIs' and IRIs' impact on well-formedness	2	Dec 13, 2009
parsing nested unbounded XML fields with ElementTree	6	Nov 25, 2013
XML in XMPP	8	Jul 6, 2012

Hacks for parsing non well-formed XML ?

Andy Dingley

Richard Tobin

Joe Kesselman

Andy Dingley

Simon Brooke

Joseph Kesselman

Peter Flynn

Andy Dingley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads