How to convert pdf to xml

fancyerii · Nov 12, 2007

Hi, all.
I want to convert a pdf file to a xml which has not only the text
of the pdf file but also the layout information (e.g. font). There are
dozens of open source libraries--pdfbox, itext,..... Which one is
better for me? Thank you.

Andrew Thompson · Nov 12, 2007

fancyerii wrote:
...

I want to convert a pdf file to a xml which has not only the text
of the pdf file but also the layout information (e.g. font).

Note that XML was originally designed to hold data and
objects *as opposed to* representation/rendering of that
data - 'how it will look'. Rendering instructions for the data
in XML might be encoded into an XSLT File.

Trying to shove a WYSIWYG document into 'XML' seems
contrary to what XML is for.

What is it exactly you want to do with the converted data?
What does this ability offer to the end user?

--
Andrew Thompson
http://www.athompson.info/andrew/

Message posted via JavaKB.com
http://www.javakb.com/Uwe/Forums.aspx/java-general/200711/1

Pavel Lepin · Nov 12, 2007

Andrew Thompson said:
Note that XML was originally designed to hold data and
objects *as opposed to* representation/rendering of that
data - 'how it will look'.

Um, I believe that's incorrect. One of XML design goals as
stated in XML 1.0 4E, 1.1 is:

XML shall support a wide variety of applications.

XML, essentially, is pure syntax with no semantics. What
semantics of a particular XML application are aimed at
expressing is up to the people who designed that
application. It might be semantic document markup
(DocBook), presentational document markup (XSL-FO),
something that would take a Semantic Web zealot[*] to
describe properly (RDF/XML), a program in Turing-complete
programming language (XSLT), remote method invocation
(SOAP), vector graphics (SVG) or pretty much anything else
that could be represented as a tree.

Rendering instructions for the data in XML might be
encoded into an XSLT File.

You're probably thinking of XSL-FO (which is an XML
application itself). XSLT problem domain is a bit wider,
it's a fairly powerful document transformation language, -
although, indeed, one of the primary use cases considered
when the spec was being developed was transformation of
random XML documents to XSL-FO. Oh, and XSLT is an XML
application as well. *shrug*

Trying to shove a WYSIWYG document into 'XML' seems
contrary to what XML is for.

XSL-FO is an XML application that is, in short,
presentational markup for paged media. Apache FOP is a FOSS
XSL-FO processor oft-used to convert XSL-FO documents
(usually generated from something else) to PDF documents. I
haven't heard of anyone doing it the other way 'round
(PDF -> XSL-FO), but Google might have.

[*] meaning no disrespect to Semantic Web zealots

fancyerii · Nov 12, 2007

I want to get the text's layout information to analyse the pdf
file. The xml is just a way of express it.
for example, I want to extract the text of a pdf file like this.
<line fontname="..." fontsize=" " startx=" " starty=" " endx
endy ....>
a line.

Andrew Thompson · Nov 12, 2007

fancyerii said:
I want ..

Please refrain fom top-posting.

fancyerii · Nov 12, 2007

fancyerii wrote:

..

Note that XML was originally designed to hold data and
objects *as opposed to* representation/rendering of that
data - 'how it will look'. Rendering instructions for the data
in XML might be encoded into an XSLT File.

I just want to get the text of a pdf file and the layout information. The xml file is just a way of storing it.

A possible result may be:
<line number="1"fontname="" fontsize="" startx="" ....>
a line
</line>

Wildemar Wildenburger · Nov 12, 2007

fancyerii said:
I want to get the text's layout information to analyse the pdf
file. The xml is just a way of express it.
for example, I want to extract the text of a pdf file like this.
<line fontname="..." fontsize=" " startx=" " starty=" " endx
endy ....>
a line.
</line>

If that is all you want to do, I would suggest you don't bother with the
XML at all. You can do an analysis of the properties of the text just as
well (heck, better?) using direct java data structures. Right?

/W

Joshua Cranmer · Nov 12, 2007

fancyerii said:
Hi, all.
I want to convert a pdf file to a xml which has not only the text
of the pdf file but also the layout information (e.g. font). There are
dozens of open source libraries--pdfbox, itext,..... Which one is
better for me? Thank you.

Adobe has a version of PDF that is essentially a zip file of several XML
documents called the Mars format. Is that sufficient for you?

How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
Creating a direct download div link for pdf file	3	Mar 19, 2023
How to Convert Apple Mail MBOX Files to Outlook MSG?	4	Oct 4, 2024
How to create PDF file in Batch	5	May 11, 2022
How to Move MBOX Files to Hotmail Account?	3	Oct 12, 2024
Image shifts to the right when export the page to pdf	4	May 5, 2023

How to convert pdf to xml

fancyerii

Andrew Thompson

Pavel Lepin

fancyerii

Andrew Thompson

fancyerii

Wildemar Wildenburger

Joshua Cranmer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads