converting stuff to xml files?

Y

yawnmoth

XSL stylesheets can be used to convert an XML file into whatever
binary format you want (DocBook, for example, does PDF's). My
question is... what if you wanted to go in the other direction? (eg.
convert a PDF to DocBook) Could you do that with existing XML
utilities or would you have to write your own program to do that?
 
T

Tony Lavinio

yawnmoth said:
XSL stylesheets can be used to convert an XML file into whatever
binary format you want (DocBook, for example, does PDF's). My
question is... what if you wanted to go in the other direction? (eg.
convert a PDF to DocBook) Could you do that with existing XML
utilities or would you have to write your own program to do that?

In this case, XSL stylesheets actually turn XML to XSL-FO, which is a
specific type of XML that post-processors then turn into PDF's. That
is why you need Apache FOP or XEP or something else.

XSL only does XML-to-XML, XML-to-text, and XML-to-HTML.

You can go from some formats into XML, using XSLT 2.0's unparsed-text()
function, which reads a URI-addressable resource into a string, but
that's pretty much it.

There are products which will convert from non-XML to XML; we sell some,
other companies sell others, and some are open source.

PDF to XML is hard, since the data isn't always rendered in the order
in which it went in. Each piece of text is stored pretty much as x, y,
value (simplifying a lot!), and the x's and y's aren't necessarily
sorted. There is a java suite called PDFBox which is open source and
has some useful tools for parsing PDF's; there was another company that
specialized in it somewhere, but just Google for "pdf to xml" and see
what you find.
 
D

Dimitre Novatchev

yawnmoth said:
XSL stylesheets can be used to convert an XML file into whatever
binary format you want (DocBook, for example, does PDF's). My
question is... what if you wanted to go in the other direction? (eg.
convert a PDF to DocBook) Could you do that with existing XML
utilities or would you have to write your own program to do that?

Using the unparsed-text() function one can read and process any text file
with XSLT 2.0.

See for example the JSON to XML convertor[1] (the FXSL function
f:json-document), which uses the LR-Parsing Framework[2] of FXSL[3] (all
completely written in pure XSLT 2.0).

Given a LR(1) grammar of a language one can produce a language processor in
pure XSLT 2.0 in a straightforward manner.

Cheers,
Dimitre Novatchev

1.
http://fxsl.cvs.sourceforge.net/fxsl/fxsl-xslt2/f/func-json-document.xsl?view=markup&sortby=date

2.
http://fxsl.cvs.sourceforge.net/fxsl/fxsl-xslt2/f/func-lrParse.xsl?view=markup&sortby=date

3. http://fxsl.sf.net
 
Y

yawnmoth

yawnmothwrote:

In this case, XSL stylesheets actually turn XML to XSL-FO, which is a
specific type of XML that post-processors then turn into PDF's. That
is why you need Apache FOP or XEP or something else.

XSL only does XML-to-XML, XML-to-text, and XML-to-HTML.
Binary files kinda are text files. Sure, the average text file might
not contain null bytes, but who's to say one can't?
 
K

Kenneth Porter

Sure, the average text file might
not contain null bytes, but who's to say one can't?

Text is printable. How do you print a null byte?

OTOH, text must be encoded. Certain encodings do, in fact, contain null
bytes.
 
K

Kenneth Porter

what if you wanted to go in the other direction? (eg.
convert a PDF to DocBook) Could you do that with existing XML
utilities or would you have to write your own program to do that?

How do you get toothpaste back into a tube? How do you get milk back into a
cow? Certain transformations are straightforward, while others may be
impossible.

To convert from PDF, you need to completely specify the transform rules.
You need to have a good understanding of the PDF format, including all the
odd cases that inevitably get used by some PDF-writing tool. (Check out the
kinds of HTML that the many versions of Office generate to see how badly a
tool can abuse a format.)
 
K

Ken Starks

yawnmoth said:
XSL stylesheets can be used to convert an XML file into whatever
binary format you want (DocBook, for example, does PDF's). My
question is... what if you wanted to go in the other direction? (eg.
convert a PDF to DocBook) Could you do that with existing XML
utilities or would you have to write your own program to do that?


Adobe have an experimental project, called `Mars' for what they call
an 'xml-friendly' format for pdf.

See more at:
http://labs.adobe.com/technologies/mars/
 
A

Andy Dingley

XSL stylesheets can be used to convert an XML file into whatever
binary format you want (DocBook, for example, does PDF's). My
question is... what if you wanted to go in the other direction?

2nd law of thermodynamics (wiki it) applies.
You can always "lose" information, but you can't re-generate it.

"Information" can mean content or structure equally well here. Turning
text in different XML elements into plain rendered text (such as a
bitmap or PDF) is "lossy", because you lose the knowledge of which
class of element it came from.

This applies to a "closed system", so sticking information or hints
back onto it from outside counts as "cheating" :cool: It's also hard to
do, very hard if you're talking about a production-grade bulk system.


So, the upshot of all this is: keep your content in a semantically-
rich, structure-preserving format for as long as possible. Transform
it into "simple" presentation formats at the very last moment.
Investigate ways to keep the semantics intact, even when published to
these simple formats, e.g. HTML might lose the XMl element names in
favour of making everthing a <div>, but you can still preserve that
information by adding suitable class attributes.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,830
Latest member
HeleneMull

Latest Threads

Top