A
Andy Dingley
Given this badly-formed fragment, any suggestions on how best to parse
it?
[...]
<dc:title><Browse By Subject></dc:title>
[...]
The minimal problem is "unexpected < character at the beginning of
character data"
I don't know how it arises. I suspect that it's a character string
with "<" in that isn't being encoded properly. Although it might be
some crazy tag-name getting squirted into the wrong end of the XML
generator. Anyway, it's the badly-formed output of a major bluechip
dot-com and it's likely to stay that way. Our problem is how to chow
down on it, despite its bad formation. 8-(
It's not too important to preserve the content here. The good stuff is
elsewhere in the document, this is just grit in the way.
So, any suggestions on how best to abuse XML standards or tools and
get it parsed with minimum work?
I've wondered about hacking to recognise tag closure as being
triggered by any whitespace, or by discarding starttags that aren't
from a small known list. I don't much like either though. Most robust
so far seems to be a parser where "<dc:title>" becomes part of the
syntax itself and has special handling. Any better ideas?
it?
[...]
<dc:title><Browse By Subject></dc:title>
[...]
The minimal problem is "unexpected < character at the beginning of
character data"
I don't know how it arises. I suspect that it's a character string
with "<" in that isn't being encoded properly. Although it might be
some crazy tag-name getting squirted into the wrong end of the XML
generator. Anyway, it's the badly-formed output of a major bluechip
dot-com and it's likely to stay that way. Our problem is how to chow
down on it, despite its bad formation. 8-(
It's not too important to preserve the content here. The good stuff is
elsewhere in the document, this is just grit in the way.
So, any suggestions on how best to abuse XML standards or tools and
get it parsed with minimum work?
I've wondered about hacking to recognise tag closure as being
triggered by any whitespace, or by discarding starttags that aren't
from a small known list. I don't much like either though. Most robust
so far seems to be a parser where "<dc:title>" becomes part of the
syntax itself and has special handling. Any better ideas?