convert .pdf files to .txt files

Davor · Jun 10, 2006

Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,
I have been looking for libraries on python and the pdftools seems to
be the solution, but I do not know how to use them well,
this is the example that I found on the internet is:

from pdftools.pdffile import PDFDocument
from pdftools.pdftext import Text

def contents_to_text (contents):
for item in contents:
if isinstance (item, type ([])):
for i in contents_to_text (item):
yield i
elif isinstance (item, Text):
yield item.text

doc = PDFDocument ("/home/dave/pruebas_ficheros/carlos.pdf")
n_pages = doc.count_pages ()
text = []

for n_page in range (1, (n_pages+1)):
print "Page", n_page
page = doc.read_page (n_page)
contents = page.read_contents ().contents
text.extend (contents_to_text (contents))

print "".join (text)

the problem is that on some pdf´s it generates join words and In
spanish the "acentos"
in words like: "camión" goes to --> cami/86n or
"IMPLEMENTACIÓN" -----> "IMPLEMENTACI?" give strange
characters
if someone knows how to use the pdftools and can help me it makes me
very happy.

Another thing is that I can see the letters readden from .pdf on the
screen, but I do not know how to create a file and save this
information inside the file a .txt

Sorry for my english.
Thanks for all.

Baiju M · Jun 10, 2006

Davor said:
Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,

If you have 'xpdf' installed in your system,
'pdftotext' command will be available in your system.

Now to convert a pdf to text from Python use system call.
For example:

import os
os.system("pdftotext -layout my_pdf_file.pdf")

This will create 'my_pdf_file.txt' file.

Regards,
Baiju M

vasudevram · Jun 10, 2006

If you don't already have xpdf, you can get it here:

http://glyphandcog.com/Xpdf.html

Install it and then try what Baiju said, should work.
I've used it, its good, that's why I say it should work. If any
problems, post here again.

-------------------------------------------------------------------------------------------
Vasudev Ram
Independent software consultant
Personal site: http://www.geocities.com/vasudevram
PDF conversion tools: http://sourceforge.net/projects/xtopdf
-------------------------------------------------------------------------------------------

David Boddie · Jun 10, 2006

Davor said:
Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,
I have been looking for libraries on python and the pdftools seems to
be the solution, but I do not know how to use them well,
this is the example that I found on the internet is:
[...]

for n_page in range (1, (n_pages+1)):
print "Page", n_page
page = doc.read_page (n_page)
contents = page.read_contents ().contents
text.extend (contents_to_text (contents))

print "".join (text)

the problem is that on some pdf´s it generates join words and In
spanish the "acentos"
in words like: "camión" goes to --> cami/86n or
"IMPLEMENTACIÓN" -----> "IMPLEMENTACI?" give strange
characters

pdftools just extracts the textual data in the file and stores it in
Text instances - it doesn't try to interpret or decode the text. I'd
like to fix the library so that it does try and decode the text
properly and put it into unicode strings, but I don't have the time
right now.

Remember that text can be stored in PDF files in many different
ways, and that the text cannot always be extracted in its original
form.

if someone knows how to use the pdftools and can help me it makes me
very happy.

Another thing is that I can see the letters readden from .pdf on the
screen, but I do not know how to create a file and save this
information inside the file a .txt

You need to do something like this:

f = open("myfilename", "w").write("".join (text))

Sorry for my english.

Don't worry about it. It's much better than my Spanish will ever be.

Sorry I couldn't give you more help with this. You may find that the
other tools mentioned by people in this thread will do what you
need better than pdftools can at the moment.

David

Davor · Jun 14, 2006

Thanks for all you wrote, It will be very usefull to me, at the end I
use that code and the file I introduce is converted to .txt on the
directory where the file is placed, and in documents written in spanish
this do not gives problems on "acentos" in words like "camión" or
"introducción" that was very important to me. Thanks!

import os
os.system("pdftotext -layout my_pdf_file.pdf")

#This will create 'my_pdf_file.txt' file.

How to Convert Apple Mail MBOX Files to Outlook MSG?	4	Oct 4, 2024
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
Sending Error when attaching files	1	Aug 7, 2023
How do I Exchange MBOX Files in PST format?	3	Oct 17, 2024
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
Convert Word .doc to Acrobat .pdf files	0	Jun 6, 2008

convert .pdf files to .txt files

Davor

Baiju M

vasudevram

David Boddie

Davor

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads