pdf2txt

B

B P

Is there a way via Python or even Perl to capture records from a pdf and
output a delimited text file? My work has a situation with a trunk
load of data forms that were scanned as pdfs.

The data needs to be taken from the forms and moved into a database, so
I figure that comma-delimited format will work fine. The amount of
man-hours it would take to manually do this is very cost-prohibitive for
what we have to work with.

I know that a txt2pdf exists, was checking to see if the opposite would
as well.

BP
 
L

LB

I know that a txt2pdf exists, was checking to see if the opposite would
as well.

I'm sure that from Acrobat you can save a .pdf as .rtf (that is text...).
Then it will be easy to do anything on it.
I remember also some utilities to "pdf2txt", try a search on google.

LB
 
A

Aurelio Martin

B said:
Is there a way via Python or even Perl to capture records from a pdf and
output a delimited text file? My work has a situation with a trunk
load of data forms that were scanned as pdfs.

The data needs to be taken from the forms and moved into a database, so
I figure that comma-delimited format will work fine. The amount of
man-hours it would take to manually do this is very cost-prohibitive for
what we have to work with.

I know that a txt2pdf exists, was checking to see if the opposite would
as well.

BP

You may try XPDF

http://www.foolabs.com/xpdf/

They include source code and some utilities like pdfimages of pdftotext.
Maybe you can call these from Python, or link via a C extension.

Hope this helps

Aurelio
 
B

Benjamin Niemann

B said:
Is there a way via Python or even Perl to capture records from a pdf and
output a delimited text file? My work has a situation with a trunk
load of data forms that were scanned as pdfs.

The data needs to be taken from the forms and moved into a database, so
I figure that comma-delimited format will work fine. The amount of
man-hours it would take to manually do this is very cost-prohibitive for
what we have to work with.

I know that a txt2pdf exists, was checking to see if the opposite would
as well.

BP
Have a look at pdftext, part of xpdf
(http://www.foolabs.com/xpdf/home.html). This will convert the pdf into
plaintext format. You will probably have to parse this plaintext to
convert it into somesthing useful.
 
S

Steve Holden

LB said:
I'm sure that from Acrobat you can save a .pdf as .rtf (that is text...).
Then it will be easy to do anything on it.
I remember also some utilities to "pdf2txt", try a search on google.

LB
Unfortunately the text you get from Acrobat, or most other
transformations on PDF, won't guarantee any particular order of the
elements. This will make pasing difficult, but if all your documents are
similar you may get enough similarity from a text (not, IIRC, rich text)
file from Acrobat.

For extra marks you can use Acrobat's automation interfaces to actually
convert the PDFs. Good luck!

regards
Steve
 
T

Tim Roberts

B P said:
Is there a way via Python or even Perl to capture records from a pdf and
output a delimited text file? My work has a situation with a trunk
load of data forms that were scanned as pdfs.

SCANNED as PDFs? Do you mean these were paper forms, filled in using
printed handwriting, then scanned into a TIFF and wrapped up in a PDF?

If so, your job is next to impossible. You can extract the original
bitmapped image out of the PDF, and from that you MIGHT be able to use an
OCR program to extract the text, but unless the forms were specifically
designed for machine reading, that process tends to be error-prone. It
might be more efficient to have human beings translate them.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,201
Messages
2,571,048
Members
47,647
Latest member
NelleMacy9

Latest Threads

Top