F
flebber
I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.
I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.
I was using the last script on that page that was most recently
updated. I am using python 2.6.
http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/
import pyPdf
def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
This is my error.
Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated
Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in <module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
NET.pdf'
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.
I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.
I was using the last script on that page that was most recently
updated. I am using python 2.6.
http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/
import pyPdf
def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
This is my error.
Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated
Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in <module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
NET.pdf'