XML looks deceptively simple at first sight. Maybe it actually is simple, as
far as syntax goes. So why the need for all these complicated libraries? And
what could xsltproc do to an XML file that would render it unreadable to a
simple parser?
To an actual XML parser: nothing.
The problem comes when people try to parse XML using a bunch of regexps
which were obtained through trial and error (i.e. testing them on some
sample XML files and tweaking them until they work on those test cases).
That approach often leads to something which e.g. can't even handle
whitespace in any context where it didn't occur in the sample files.
You've got start-tags, end-tags, and attributes; what else is there?
Just getting those right is apparently too hard for some people. E.g.
attributes could be in any order (many XML parsers store attributes in an
associative array, so order is unlikely to be preserved), if whitespace is
allowed it can be any combination of whitespace characters, etc.
Unsupported character escapes or minor things like that? It might be easier
to add support for that than struggle with someone else's over-the-top
implementation!
And what would you want to do to the file anyway? The data will obviously
only make sense to this specific application; if there's a problem with the
content, that is going to be a problem whatever library is used to read it.
A good example is performing "bulk" processing, e.g. a simple search and
replace in many files (where the original application requires a dozen
mouse clicks to load and save each file plus another half a dozen for each
individual change).
If the data is in XML, you just need to cook up an XSL transformation
(or similar) then you can process all of the files with one command. Well,
unless the application's "XML" parser can't actually read anything other
than its own output, as you probably aren't going to find off-the-shelf
XML tools which offer the option of restricting to their output to that
which can be read by John Doe's pseudo-XML subset.
On the plus side, most of the real XML parsers were written by people who
still have the scars from trying to deal with what either Netscape or
Microsoft thought "HTML" meant. Consequently, they don't attempt to be
fault-tolerant (this may seem like a good idea in theory, but in practice
it means that every bug in a popular implementation ends up redefining the
de-facto standard until it's so complex that writing a parser which can
handle more than 50% of "HTML as deployed" is more work than the Apollo
program).
So at least we don't normally have to worry about the output being a
superset of the standard (if it doesn't conform, hardly anything will
parse it). We just have to worry about the hordes of strcmp-and-regexp
parsers turning the de-facto standard into an ever-shrinking subset of the
real thing.