pdf2txt

B P · May 28, 2004

Is there a way via Python or even Perl to capture records from a pdf and
output a delimited text file? My work has a situation with a trunk
load of data forms that were scanned as pdfs.

The data needs to be taken from the forms and moved into a database, so
I figure that comma-delimited format will work fine. The amount of
man-hours it would take to manually do this is very cost-prohibitive for
what we have to work with.

I know that a txt2pdf exists, was checking to see if the opposite would
as well.

BP

LB · May 28, 2004

I know that a txt2pdf exists, was checking to see if the opposite would
as well.

I'm sure that from Acrobat you can save a .pdf as .rtf (that is text...).
Then it will be easy to do anything on it.
I remember also some utilities to "pdf2txt", try a search on google.

LB

Aurelio Martin · May 28, 2004

B said:
Is there a way via Python or even Perl to capture records from a pdf and
output a delimited text file? My work has a situation with a trunk
load of data forms that were scanned as pdfs.

The data needs to be taken from the forms and moved into a database, so
I figure that comma-delimited format will work fine. The amount of
man-hours it would take to manually do this is very cost-prohibitive for
what we have to work with.

I know that a txt2pdf exists, was checking to see if the opposite would
as well.

BP

You may try XPDF

http://www.foolabs.com/xpdf/

They include source code and some utilities like pdfimages of pdftotext.
Maybe you can call these from Python, or link via a C extension.

Hope this helps

Aurelio

Benjamin Niemann · May 28, 2004

B said:
Is there a way via Python or even Perl to capture records from a pdf and
output a delimited text file? My work has a situation with a trunk
load of data forms that were scanned as pdfs.

The data needs to be taken from the forms and moved into a database, so
I figure that comma-delimited format will work fine. The amount of
man-hours it would take to manually do this is very cost-prohibitive for
what we have to work with.

I know that a txt2pdf exists, was checking to see if the opposite would
as well.

BP

Have a look at pdftext, part of xpdf
(http://www.foolabs.com/xpdf/home.html). This will convert the pdf into
plaintext format. You will probably have to parse this plaintext to
convert it into somesthing useful.

Marco Aschwanden · May 28, 2004

For me 'ps2ascii' did the job...

Steve Holden · May 28, 2004

LB said:
I'm sure that from Acrobat you can save a .pdf as .rtf (that is text...).
Then it will be easy to do anything on it.
I remember also some utilities to "pdf2txt", try a search on google.

LB

Unfortunately the text you get from Acrobat, or most other
transformations on PDF, won't guarantee any particular order of the
elements. This will make pasing difficult, but if all your documents are
similar you may get enough similarity from a text (not, IIRC, rich text)
file from Acrobat.

For extra marks you can use Acrobat's automation interfaces to actually
convert the PDFs. Good luck!

regards
Steve

Cameron Laird · May 28, 2004

Is there a way via Python or even Perl to capture records from a pdf and
output a delimited text file? My work has a situation with a trunk

.
.
.
<URL: http://phaseit.net/claird/comp.text.pdf/PDF_converters.html#pdf2txt >

Tim Roberts · May 30, 2004

B P said:
Is there a way via Python or even Perl to capture records from a pdf and
output a delimited text file? My work has a situation with a trunk
load of data forms that were scanned as pdfs.

SCANNED as PDFs? Do you mean these were paper forms, filled in using
printed handwriting, then scanned into a TIFF and wrapped up in a PDF?

If so, your job is next to impossible. You can extract the original
bitmapped image out of the PDF, and from that you MIGHT be able to use an
OCR program to extract the text, but unless the forms were specifically
designed for machine reading, that process tends to be error-prone. It
might be more efficient to have human beings translate them.

pdf2txt	2	May 28, 2004
Programming challenge?	4	Jul 23, 2021
[PAID][REMOTE] Hiring programmer/dev for indie game	2	Feb 19, 2023
I'm tempted to quit out of frustration	1	Aug 13, 2023
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Question about my projects	3	Jul 23, 2021
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
Noob: Trying to run two python scrips on a pfsense/freeBSD for the OWL-Intuition-LC	2	Jan 2, 2013

pdf2txt

B P

LB

Aurelio Martin

Benjamin Niemann

Marco Aschwanden

Steve Holden

Cameron Laird

Tim Roberts

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads