Hacks for parsing non well-formed XML ?

A

Andy Dingley

Given this badly-formed fragment, any suggestions on how best to parse
it?

[...]
<dc:title><Browse By Subject></dc:title>
[...]

The minimal problem is "unexpected < character at the beginning of
character data"

I don't know how it arises. I suspect that it's a character string
with "<" in that isn't being encoded properly. Although it might be
some crazy tag-name getting squirted into the wrong end of the XML
generator. Anyway, it's the badly-formed output of a major bluechip
dot-com and it's likely to stay that way. Our problem is how to chow
down on it, despite its bad formation. 8-(

It's not too important to preserve the content here. The good stuff is
elsewhere in the document, this is just grit in the way.

So, any suggestions on how best to abuse XML standards or tools and
get it parsed with minimum work?

I've wondered about hacking to recognise tag closure as being
triggered by any whitespace, or by discarding starttags that aren't
from a small known list. I don't much like either though. Most robust
so far seems to be a parser where "<dc:title>" becomes part of the
syntax itself and has special handling. Any better ideas?
 
R

Richard Tobin

Andy Dingley said:
Given this badly-formed fragment, any suggestions on how best to parse
it?
<dc:title><Browse By Subject></dc:title>
[...]

I've wondered about hacking to recognise tag closure as being
triggered by any whitespace, or by discarding starttags that aren't
from a small known list.

You could make a pass through to determine probably-legal element
names, by looking for end tags. "</Browse" is much less likely to
occur than "<Browse". Then escape less-thans that don't precede an
element name for which you found a plausible end tag. Empty tags
are less clear cut, but you could probably find a 99% solution.

-- Richard
 
J

Joe Kesselman

Andy said:
Given this badly-formed fragment, any suggestions on how best to parse
it?

Best suggestions I've got are:

1) XML tools won't touch this. Write a text-processing layer which finds
and fixes these abuses before even thinking about it as XML. It's going
to be messy, fragile, ad-hoc programming.

2) Fix the code that generates it. Seriously. This is going to be an
ongoing hassle, and cost, until you do.
 
A

Andy Dingley

2) Fix the code that generates it. Seriously. This is going to be an
ongoing hassle, and cost, until you do.

It's! a! big! famous! dotcom! not! my! own! code!
(Can you guess who it is yet?)

Do You Snafu! :cool:
 
S

Simon Brooke

Andy said:
Given this badly-formed fragment, any suggestions on how best to parse
it?

[...]
<dc:title><Browse By Subject></dc:title>
[...]

The minimal problem is "unexpected < character at the beginning of
character data"

sed 's/<Browse By Subject>//'

There's no particular reason why you shouldn't use old and proven text
manipulation tools on XML.
 
J

Joseph Kesselman

It's! a! big! famous! dotcom! not! my! own! code!

Talk! To! Them! About! It!.

Though you may find that this is a deliberate poison-pill to prevent
unauthorized folks mining their servers... in which case you should
probably be talking to them about getting more official access, since
they're probably changing the poison on a regular basis and anything you
attempt to do to bypass it is likely to break again in a few weeks.
 
P

Peter Flynn

Andy said:
It's! a! big! famous! dotcom! not! my! own! code!
(Can you guess who it is yet?)

Do You Snafu! :cool:

Nevertheless, charge them extra and mark it on the invoice as overhead
for manual handling of non-XML material. If they're that big, they'll
pay, and if they're that stupid, they'll continue to pay you rather than
fix the bug.

///Peter
 
A

Andy Dingley

Though you may find that this is a deliberate poison-pill to prevent
unauthorized folks mining their servers...

Oh, I _wish_ they were that smart.

Just to clarify, it's a public interface to their services that they
encourage(sic) the use of. The likelihood of them fixing it is on the
avian-pig scale. It's also not a static string, so any sed-ing would
need a slightly more sophisticated regex to work on it, although it's
entirely viable. Sadly it's also an embedded app, so Unix tools just
aren't present. A similar pre-processor approach seems best though,
rather than frobbing a parser.

Thanks for all your suggestions.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top