Python, Perl & PDF files

C

Cameron Laird

.
.
.
OK, I'm done seeking to provoke. So, it's official. Perl has *much*,
*much* better support for dealing with PDF files than does Python.
.
.
.
No, it's not; maybe Perl even has *worse* support. At the very least,
there's more redundancy in what Perl offers.

I understand that <URL: http://cpan.uwinnipeg.ca/search?query=pdf&mode=dist >
looks rather convincing; in fact, a couple of years ago, we argued <URL:
http://www.unixreview.com/documents/s=7822/ur0304g/ > that indeed Perl was
ahead. I'm not sure that's true now.

Rather than wander into all the details, let's start over: what kinds of
things do you think you want to do with PDFs? You might be surprised to
find that, despite all that CPAN *seems* to offer, your needs aren't met
at all. Or maybe they are. It depends. Let's get specific.
 
R

rbt

Cameron said:
.
.
.


.
.
.
No, it's not; maybe Perl even has *worse* support. At the very least,
there's more redundancy in what Perl offers.

I understand that <URL: http://cpan.uwinnipeg.ca/search?query=pdf&mode=dist >
looks rather convincing; in fact, a couple of years ago, we argued <URL:
http://www.unixreview.com/documents/s=7822/ur0304g/ > that indeed Perl was
ahead. I'm not sure that's true now.

Rather than wander into all the details, let's start over: what kinds of
things do you think you want to do with PDFs? You might be surprised to
find that, despite all that CPAN *seems* to offer, your needs aren't met
at all. Or maybe they are. It depends. Let's get specific.

Read and search them for strings. If I could do that on windows, linux
and mac with the *same* bit of Python code, I'd be very happy ;)
 
C

Cameron Laird

.
.
.
Read and search them for strings. If I could do that on windows, linux
and mac with the *same* bit of Python code, I'd be very happy ;)

Textual content, right? Without regard to font funniness, or
whether the string is in or out of a table, and so on?

'Might be a few days before I answer; I'm crashing into end-of-
the-month deadlines.
 
R

rbt

Cameron said:
.
.
.



Textual content, right? Without regard to font funniness, or
whether the string is in or out of a table, and so on?

That's right. More specifically, I've written a script that uses a RE to search
through documents for social security numbers. You can see it here:

http://filebox.vt.edu/users/rtilley/public/find_ssns/find_ssns.html

This works on Word, Excel, html, rtf or any ANSI based text. I need the ability to
read and make sense of PDF files as well so I can apply the RE to their content. It's
been frustrating to say the least. Nothing at all against Python... mostly just sick
of hearing about the 'Portable' document format that isn't string or RE searchable...
at least not easily anyway.
'Might be a few days before I answer; I'm crashing into end-of-
the-month deadlines.

No problem. Thanks for the help.
 
P

paron

Hopefully, Adobe will choose to support SVG as a response to
Microsoft's "Metro", and take us all off the hook with respect to
cracking open their proprietary format.
 
C

Cameron Laird

That's right. More specifically, I've written a script that uses a RE to search
through documents for social security numbers. You can see it here:

http://filebox.vt.edu/users/rtilley/public/find_ssns/find_ssns.html

This works on Word, Excel, html, rtf or any ANSI based text. I need the
ability to
read and make sense of PDF files as well so I can apply the RE to their
content. It's
been frustrating to say the least. Nothing at all against Python...
mostly just sick
of hearing about the 'Portable' document format that isn't string or RE
searchable...
at least not easily anyway.
.
.
.
PDF is NOT easy to search. 'Fact, many times it's not even feasible,
in any automated sense.

When I can make time, I want to look into your Word and Excel searching;
there are several tricks for doing these in full generality.

Unless I've missed late-breaking news, Perl does NOT help, despite the
flashy appearance of the CPAN search page you referenced. None of that
stuff gets at content in a sense that'll serve you well.

Neither does anything open-sourced in Python. The best I know is what
I'm slowly documenting at <URL:
http://phaseit.net/claird/comp.text.pdf/PDF_converters.html#pdf2txt >,
as David mentioned earlier.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,237
Messages
2,571,189
Members
47,823
Latest member
eipamiri

Latest Threads

Top