yawnmoth said:
XSL stylesheets can be used to convert an XML file into whatever
binary format you want (DocBook, for example, does PDF's). My
question is... what if you wanted to go in the other direction? (eg.
convert a PDF to DocBook) Could you do that with existing XML
utilities or would you have to write your own program to do that?
In this case, XSL stylesheets actually turn XML to XSL-FO, which is a
specific type of XML that post-processors then turn into PDF's. That
is why you need Apache FOP or XEP or something else.
XSL only does XML-to-XML, XML-to-text, and XML-to-HTML.
You can go from some formats into XML, using XSLT 2.0's unparsed-text()
function, which reads a URI-addressable resource into a string, but
that's pretty much it.
There are products which will convert from non-XML to XML; we sell some,
other companies sell others, and some are open source.
PDF to XML is hard, since the data isn't always rendered in the order
in which it went in. Each piece of text is stored pretty much as x, y,
value (simplifying a lot!), and the x's and y's aren't necessarily
sorted. There is a java suite called PDFBox which is open source and
has some useful tools for parsing PDF's; there was another company that
specialized in it somewhere, but just Google for "pdf to xml" and see
what you find.