Errors with PyPdf

F

flebber

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.

I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.

I was using the last script on that page that was most recently
updated. I am using python 2.6.

http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

This is my error.

Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated

Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in <module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
NET.pdf'
 
M

MRAB

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.
The 'sets' module pre-dates the built-in 'set' class. The warning is
just to inform you that the module will be removed in due course (it's
still in Python 2.7, but not Python 3), so you can still use it in
those versions.
I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.

I was using the last script on that page that was most recently
updated. I am using python 2.6.

http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

This is my error.

Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated

Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in<module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
NET.pdf'

You put the file in C:\, but you didn't tell Python where it is. You
gave just the filename "Components-of-Dot-NET.pdf", and it's looking in
the current directory, which probably isn't C:\.

Try providing the full pathname:

print
getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii", "ignore")
 
W

w.g.sneddon

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.

I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.

I was using the last script on that page that was most recently
updated. I am using python 2.6.

http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co...

import pyPdf

def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

This is my error.



Warning (from warnings module):
  File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
    from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated

Traceback (most recent call last):
  File "C:/Python26/Pdfread", line 15, in <module>
    print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
  File "C:/Python26/Pdfread", line 6, in getPDFContent
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
Looks like a issue with finding the file.
how do you pass the path?
 
F

flebber

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.
I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.
I was using the last script on that page that was most recently
updated. I am using python 2.6.

import pyPdf
def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
This is my error.
Warning (from warnings module):
  File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
    from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated
Traceback (most recent call last):
  File "C:/Python26/Pdfread", line 15, in <module>
    print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
  File "C:/Python26/Pdfread", line 6, in getPDFContent
    pdf = pyPdf.PdfFileReader(file(path, "rb"))

---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'

Looks like a issue with finding the file.
how do you pass the path?

okay thanks I thought that when I set content here

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"

that i was defining where it is.

but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
 
F

flebber

---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'
Looks like a issue with finding the file.
how do you pass the path?

okay thanks I thought that when I set content here

def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"

that i was defining where it is.

but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?

import pyPdf

def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

I have found far more advanced scripts searching around. But will have
to keep trying as I cannot get an output file or specify the path.

Edit very strangely whilst searching for examples I found my own post
just written here ranking number 5 on google within 2 hours. Bizzare.

http://www.eggheadcafe.com/software/aspnet/36237766/errors-with-pypdf.aspx

Replicates our thread as thiers. I was searching ggole with "pypdf
return to txt file"
 
F

flebber

On Sep 27, 9:38 am, "(e-mail address removed)" <[email protected]>
wrote:
I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.
I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.
I was using the last script on that page that was most recently
updated. I am using python 2.6.
http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co...
import pyPdf
def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
This is my error.
Warning (from warnings module):
  File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
    from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated
Traceback (most recent call last):
  File "C:/Python26/Pdfread", line 15, in <module>
    print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
  File "C:/Python26/Pdfread", line 6, in getPDFContent
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'
Looks like a issue with finding the file.
how do you pass the path?
okay thanks I thought that when I set content here
def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"
that i was defining where it is.
but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?
import pyPdf
def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.pdf"
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content
print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

I have found far more advanced scripts searching around. But will have
to keep trying as I cannot get an output file or specify the path.

Edit very strangely whilst searching for examples I found my own post
just written here ranking number 5 on google within 2 hours. Bizzare.

http://www.eggheadcafe.com/software/aspnet/36237766/errors-with-pypdf...

Replicates our thread as thiers. I was searching ggole with "pypdf
return to txt file"

Traceback (most recent call last):
File "C:/Python26/Pdfread", line 16, in <module>
open('x.txt', 'w').write(content)
NameError: name 'content' is not defined
When i use.

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.txt"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
open('x.txt', 'w').write(content)
 
M

MRAB

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.
I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.
I was using the last script on that page that was most recently
updated. I am using python 2.6.

import pyPdf
def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
This is my error.
Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated
Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in<module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))

---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'

Looks like a issue with finding the file.
how do you pass the path?

okay thanks I thought that when I set content here

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"

that i was defining where it is.

but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"

That simply binds to a local name; 'content' is a local variable in the
function 'getPDFContent'.
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))

You're opening a file whose path is in 'path'.
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"

That appends to 'content'.
# Collapse whitespace

'content' now contains the text of the PDF, starting with
r"C:\Components-of-Dot-NET.pdf".
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
Outputting to a .txt file is simple: open the file for writing using
'open', write the string to it, and then close it.
 
F

flebber

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.
I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.
I was using the last script on that page that was most recently
updated. I am using python 2.6.
http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co....
import pyPdf
def getPDFContent(path):
     content = "C:\Components-of-Dot-NET.pdf"
     # Load PDF into pyPDF
     pdf = pyPdf.PdfFileReader(file(path, "rb"))
     # Iterate pages
     for i in range(0, pdf.getNumPages()):
         # Extract text from page and add to content
         content += pdf.getPage(i).extractText() + "\n"
     # Collapse whitespace
     content = " ".join(content.replace(u"\xa0", " ").strip()..split())
     return content
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
This is my error.
Warning (from warnings module):
   File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
     from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated
Traceback (most recent call last):
   File "C:/Python26/Pdfread", line 15, in<module>
     print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
   File "C:/Python26/Pdfread", line 6, in getPDFContent
     pdf = pyPdf.PdfFileReader(file(path, "rb"))
--->  IOError: [Errno 2] No such file or directory: 'Components-of-Dot->  NET.pdf'
Looks like a issue with finding the file.
how do you pass the path?
okay thanks I thought that when I set content here
def getPDFContent(path):
     content = "C:\Components-of-Dot-NET.pdf"
that i was defining where it is.
but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?
import pyPdf
def getPDFContent(path):
     content = "C:\Components-of-Dot-NET.pdf"

That simply binds to a local name; 'content' is a local variable in the
function 'getPDFContent'.
     # Load PDF into pyPDF
     pdf = pyPdf.PdfFileReader(file(path, "rb"))

You're opening a file whose path is in 'path'.
     # Iterate pages
     for i in range(0, pdf.getNumPages()):
         # Extract text from page and add to content
         content += pdf.getPage(i).extractText() + "\n"

That appends to 'content'.
     # Collapse whitespace

'content' now contains the text of the PDF, starting with
r"C:\Components-of-Dot-NET.pdf".
     content = " ".join(content.replace(u"\xa0", " ").strip().split())
     return content
print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

Outputting to a .txt file is simple: open the file for writing using
'open', write the string to it, and then close it.

Thats what I was trying to do with

open('x.txt', 'w').write(content)

the rest of the script works it wont output the tect though
 
D

Dave Angel

<snip>
Traceback (most recent call last):
File "C:/Python26/Pdfread", line 16, in<module>
open('x.txt', 'w').write(content)
NameError: name 'content' is not defined
When i use.

import pyPdf

def getPDFContent(path):
content =C:\Components-of-Dot-NET.txt"
# Load PDF into pyPDF
pdf =yPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content +=df.getPage(i).extractText() + "\n"
# Collapse whitespace
content = ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
open('x.txt', 'w').write(content)
There's no global variable content, that was local to the function. So
it's lost when the function exits. it does return the value, but you
give it to print, and don't save it anywhere.

data = getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

outfile = open('x.txt', 'w')
outfile.write(data)

close(outfile)

I used a different name to emphasize that this is *not* the same
variable as content inside the function. In this case, it happens to
have the same value. And if you used the same name, you could be
confused about which is which.


DaveA
 
F

flebber

There's no global variable content, that was local to the function.  So
it's lost when the function exits.  it does return the value, but you
give it to print, and don't save it anywhere.

data = getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

outfile = open('x.txt', 'w')
outfile.write(data)

close(outfile)

I used a different name to emphasize that this is *not* the same
variable as content inside the function.  In this case, it happens to
have the same value.  And if you used the same name, you could be
confused about which is which.

DaveA

Thank You everyone.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top