Ragnar said:
How can I transform this source-xml into valid xml without using the
workaround of getting an HTML-output?
Find some non-Tidy Tidy-like XML tool ? Maybe write one for your
specific task?
Tidy uses an approximation of an SGML parser and a tag-soup strainer to
take "approximate HTML", turn it into the best-guess internal
(DOM-like) model of the intended page, then serialise it accurately.
This relies on three things that you don't have available:
* SGML parsing (omitted tags can often be inferred cleanly)
* A known HTML DTD
* Fix-up code outside the SGML parser that has assumed HTML-soup
behaviours coded explicitly into it.
If your problem is "bad XML" that isn't even approximating HTML, then I
sympathise, but Tidy has three of its hands tied.
Why is your bad XML bad? What's the problem? Can you build some specifc
tool that fixes some specific problem? Even if it has to work with
simple text-file processing and can't support more than one encoding,
it might be enough.
I've done a lot of work with RSS which is only approximate XML at best
and often significantly invalid. Typically it includes HTML entity
references (eg é )that aren't part of XML. It's not too hard to
scan the whole document with a crude entity reference expander that can
map these (from a known list) onto the numeric form. I usually try to
XML parse them, then if this fails I check for the presence of such
entities, convert them and then attempt to re-parse.