Parsing some pdf files failed

P

Peter Jamieson

I am using the following script to parse a collection of
supplied pdf files. Most of the files parsed as expected but
with some the script fell over, no output was produced,
as though the file was invisible. The files are only
alpha-numeric text, no images or graphics.

On looking at the file Properties I noticed the succesfully
parsed files had PDF Producer: doPDF Ver 6.0 Build 262
PDF Version 1.4

whilst the failures had PDF Producer: GPL Ghostscript 8.15
PDF Version 1.3

Anyone have a clue as to how I could get these errant files parsed?
Any suggestions appreciated!, Cheers, Peter

#!/usr/bin/perl -w
use strict;
use warnings;
use CAM::pDF;

my $file = 'C:/test.pdf';
my $pdf = CAM::pDF->new($file);

for my $page (1 .. $pdf->numPages()) {
my $text = $pdf->getPageText($page);

my @lines = split (/\n/, $text);

foreach my $line (@lines) {
# parse out useful information
}
}
 
P

Peter Jamieson

Eli the Bearded said:
In comp.lang.perl.misc,
Peter Jamieson said:
I am using the following script to parse a collection of
supplied pdf files. Most of the files parsed as expected but
with some the script fell over, no output was produced,
as though the file was invisible. The files are only
alpha-numeric text, no images or graphics.

Looks like a CAM:pDF problem.

I modified you script to look like this:

#!/usr/bin/perl -w
use strict;
use warnings;
use CAM::pDF;

my $file = shift;
my $pdf = CAM::pDF->new($file);

printf "%s: %d pages\n", $file, ($pdf->numPages());

for my $page (1 .. $pdf->numPages()) {
my $text = $pdf->getPageText($page);

my @lines = split (/\n/, $text);

foreach my $line (@lines) {
print $line;
}
}
__END__

And plugged in various PDFs I have lying around. One called "Hektor.pdf"
produces copious non-text output with that script, but pdftotext gives
me lots of clean text output.

$ perl /tmp/campdfprint Hektor.pdf | strings | wc
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
1 3 21
$ pdftotext Hektor.pdf ; strings Hektor.txt |wc
161 4023 23930
$ strings Hektor.pdf |head -1
%PDF-1.4
$

Elijah
------
[1] <URL:http://www.hektor.ch/Book/Hektor.pdf/>
Yes, it has a trailing slash in the link.
[2] pdftotext version 3.02
Copyright 1996-2007 Glyph & Cog, LLC

Hi Eli!
Thank you for your comments and assistance.
I had a look at the pdftotext prog but it was
unsuitable as my pdf's had tables and they were
being mangled in the conversion process.
I tried the shareware prog "PDF to Excel" and
it is promising but still needs a bit of cleaning
of the data and the files to be eyeballed, tedious
as I have many files each of many pages.
The pdf files seem to have been created originally
from .xls files so it should be possible to programmitically
reverse them maybe but beyond me.

On an unrelated matter I noticed the links to your
Hektor site: fascinating work! My daughter is a performance
artist and film maker so I will pass your site address to her!
Cheers, Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,189
Members
46,735
Latest member
HikmatRamazanov

Latest Threads

Top