D
Davor
Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,
I have been looking for libraries on python and the pdftools seems to
be the solution, but I do not know how to use them well,
this is the example that I found on the internet is:
from pdftools.pdffile import PDFDocument
from pdftools.pdftext import Text
def contents_to_text (contents):
for item in contents:
if isinstance (item, type ([])):
for i in contents_to_text (item):
yield i
elif isinstance (item, Text):
yield item.text
doc = PDFDocument ("/home/dave/pruebas_ficheros/carlos.pdf")
n_pages = doc.count_pages ()
text = []
for n_page in range (1, (n_pages+1)):
print "Page", n_page
page = doc.read_page (n_page)
contents = page.read_contents ().contents
text.extend (contents_to_text (contents))
print "".join (text)
the problem is that on some pdf´s it generates join words and In
spanish the "acentos"
in words like: "camión" goes to --> cami/86n or
"IMPLEMENTACIÓN" -----> "IMPLEMENTACI?" give strange
characters
if someone knows how to use the pdftools and can help me it makes me
very happy.
Another thing is that I can see the letters readden from .pdf on the
screen, but I do not know how to create a file and save this
information inside the file a .txt
Sorry for my english.
Thanks for all.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,
I have been looking for libraries on python and the pdftools seems to
be the solution, but I do not know how to use them well,
this is the example that I found on the internet is:
from pdftools.pdffile import PDFDocument
from pdftools.pdftext import Text
def contents_to_text (contents):
for item in contents:
if isinstance (item, type ([])):
for i in contents_to_text (item):
yield i
elif isinstance (item, Text):
yield item.text
doc = PDFDocument ("/home/dave/pruebas_ficheros/carlos.pdf")
n_pages = doc.count_pages ()
text = []
for n_page in range (1, (n_pages+1)):
print "Page", n_page
page = doc.read_page (n_page)
contents = page.read_contents ().contents
text.extend (contents_to_text (contents))
print "".join (text)
the problem is that on some pdf´s it generates join words and In
spanish the "acentos"
in words like: "camión" goes to --> cami/86n or
"IMPLEMENTACIÓN" -----> "IMPLEMENTACI?" give strange
characters
if someone knows how to use the pdftools and can help me it makes me
very happy.
Another thing is that I can see the letters readden from .pdf on the
screen, but I do not know how to create a file and save this
information inside the file a .txt
Sorry for my english.
Thanks for all.