Errors with PyPdf

flebber · Sep 27, 2010

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.

I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.

I was using the last script on that page that was most recently
updated. I am using python 2.6.

http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

This is my error.

Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated

Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in <module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
NET.pdf'

MRAB · Sep 27, 2010

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.

The 'sets' module pre-dates the built-in 'set' class. The warning is
just to inform you that the module will be removed in due course (it's
still in Python 2.7, but not Python 3), so you can still use it in
those versions.

I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.

I was using the last script on that page that was most recently
updated. I am using python 2.6.

http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

This is my error.

Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated

Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in<module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
NET.pdf'

You put the file in C:\, but you didn't tell Python where it is. You
gave just the filename "Components-of-Dot-NET.pdf", and it's looking in
the current directory, which probably isn't C:\.

Try providing the full pathname:

print
getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii", "ignore")

w.g.sneddon · Sep 27, 2010

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.

I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.

I was using the last script on that page that was most recently
updated. I am using python 2.6.

http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co...

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

This is my error.

Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated

Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in <module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))

---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-

NET.pdf'

Looks like a issue with finding the file.
how do you pass the path?

flebber · Sep 27, 2010

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.

Click to expand...

I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.

Click to expand...

I was using the last script on that page that was most recently
updated. I am using python 2.6.

import pyPdf

Click to expand...

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

Click to expand...

print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

Click to expand...

This is my error.

Click to expand...

Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated

Click to expand...

Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in <module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))

Click to expand...

---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'

Looks like a issue with finding the file.
how do you pass the path?

okay thanks I thought that when I set content here

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"

that i was defining where it is.

but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

flebber · Sep 27, 2010

---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'

Click to expand...

Looks like a issue with finding the file.
how do you pass the path?

Click to expand...

okay thanks I thought that when I set content here

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"

that i was defining where it is.

but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

I have found far more advanced scripts searching around. But will have
to keep trying as I cannot get an output file or specify the path.

Edit very strangely whilst searching for examples I found my own post
just written here ranking number 5 on google within 2 hours. Bizzare.

http://www.eggheadcafe.com/software/aspnet/36237766/errors-with-pypdf.aspx

Replicates our thread as thiers. I was searching ggole with "pypdf
return to txt file"

flebber · Sep 27, 2010

On Sep 27, 9:38 am, "(e-mail address removed)" <[email protected]>
wrote:

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.
I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.
I was using the last script on that page that was most recently
updated. I am using python 2.6.
http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co...
import pyPdf
def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
This is my error.
Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated
Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in <module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'
Looks like a issue with finding the file.
how do you pass the path?

Click to expand...

Click to expand...

okay thanks I thought that when I set content here

Click to expand...

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"

Click to expand...

that i was defining where it is.

Click to expand...

but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?

Click to expand...

import pyPdf

Click to expand...

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

Click to expand...

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

Click to expand...

I have found far more advanced scripts searching around. But will have
to keep trying as I cannot get an output file or specify the path.

Edit very strangely whilst searching for examples I found my own post
just written here ranking number 5 on google within 2 hours. Bizzare.

http://www.eggheadcafe.com/software/aspnet/36237766/errors-with-pypdf...

Replicates our thread as thiers. I was searching ggole with "pypdf
return to txt file"

Traceback (most recent call last):
File "C:/Python26/Pdfread", line 16, in <module>
open('x.txt', 'w').write(content)
NameError: name 'content' is not defined
When i use.

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.txt"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
open('x.txt', 'w').write(content)

MRAB · Sep 27, 2010

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.

Click to expand...

I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.

Click to expand...

I was using the last script on that page that was most recently
updated. I am using python 2.6.

import pyPdf

Click to expand...

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

Click to expand...

print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

Click to expand...

This is my error.

Click to expand...

Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated

Click to expand...

Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in<module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))

Click to expand...

---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'

Looks like a issue with finding the file.
how do you pass the path?

Click to expand...

okay thanks I thought that when I set content here

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"

that i was defining where it is.

but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"

That simply binds to a local name; 'content' is a local variable in the
function 'getPDFContent'.

# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))

You're opening a file whose path is in 'path'.

# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"

That appends to 'content'.

# Collapse whitespace

'content' now contains the text of the PDF, starting with
r"C:\Components-of-Dot-NET.pdf".

content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

Outputting to a .txt file is simple: open the file for writing using
'open', write the string to it, and then close it.

flebber · Sep 27, 2010

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.
I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.
I was using the last script on that page that was most recently
updated. I am using python 2.6.
http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co....
import pyPdf
def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip()..split())
return content
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
This is my error.
Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated
Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in<module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'
Looks like a issue with finding the file.
how do you pass the path?

Click to expand...

Click to expand...

okay thanks I thought that when I set content here

Click to expand...

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"

Click to expand...

that i was defining where it is.

Click to expand...

but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?

Click to expand...

import pyPdf

Click to expand...

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"

Click to expand...

That simply binds to a local name; 'content' is a local variable in the
function 'getPDFContent'.

# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))

Click to expand...

You're opening a file whose path is in 'path'.

# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"

Click to expand...

That appends to 'content'.

# Collapse whitespace

Click to expand...

'content' now contains the text of the PDF, starting with
r"C:\Components-of-Dot-NET.pdf".

content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

Click to expand...

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

Click to expand...

Outputting to a .txt file is simple: open the file for writing using
'open', write the string to it, and then close it.

Thats what I was trying to do with

open('x.txt', 'w').write(content)

the rest of the script works it wont output the tect though

Dave Angel · Sep 27, 2010

<snip>
Traceback (most recent call last):
File "C:/Python26/Pdfread", line 16, in<module>
open('x.txt', 'w').write(content)
NameError: name 'content' is not defined
When i use.

import pyPdf

def getPDFContent(path):
content =C:\Components-of-Dot-NET.txt"
# Load PDF into pyPDF
pdf =yPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content +=df.getPage(i).extractText() + "\n"
# Collapse whitespace
content = ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
open('x.txt', 'w').write(content)

There's no global variable content, that was local to the function. So
it's lost when the function exits. it does return the value, but you
give it to print, and don't save it anywhere.

data = getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

outfile = open('x.txt', 'w')
outfile.write(data)

close(outfile)

I used a different name to emphasize that this is *not* the same
variable as content inside the function. In this case, it happens to
have the same value. And if you used the same name, you could be
confused about which is which.

DaveA

flebber · Sep 27, 2010

There's no global variable content, that was local to the function. So
it's lost when the function exits. it does return the value, but you
give it to print, and don't save it anywhere.

data = getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

outfile = open('x.txt', 'w')
outfile.write(data)

close(outfile)

I used a different name to emphasize that this is *not* the same
variable as content inside the function. In this case, it happens to
have the same value. And if you used the same name, you could be
confused about which is which.

DaveA

Thank You everyone.

pypdf assert error on documentinfo	0	Jun 28, 2007
PyPDF Processing Errors (ValueError: invalid literal for int() with	0	Aug 8, 2011
How to go about with PDF regression	1	Feb 18, 2013
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
pydoc errors	1	Feb 25, 2010
tkinter errors out without clear message	0	May 21, 2014
File IO errors with PyPDF	0	Mar 9, 2007
How to loop through all the other pages in a pdf using python	3	May 16, 2023

Errors with PyPdf

flebber

MRAB

w.g.sneddon

flebber

flebber

flebber

MRAB

flebber

Dave Angel

flebber

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads