Parsing some pdf files failed

Peter Jamieson · Mar 21, 2011

I am using the following script to parse a collection of
supplied pdf files. Most of the files parsed as expected but
with some the script fell over, no output was produced,
as though the file was invisible. The files are only
alpha-numeric text, no images or graphics.

On looking at the file Properties I noticed the succesfully
parsed files had PDF Producer: doPDF Ver 6.0 Build 262
PDF Version 1.4

whilst the failures had PDF Producer: GPL Ghostscript 8.15
PDF Version 1.3

Anyone have a clue as to how I could get these errant files parsed?
Any suggestions appreciated!, Cheers, Peter

#!/usr/bin/perl -w
use strict;
use warnings;
use CAM:

DF;

my $file = 'C:/test.pdf';
my $pdf = CAM:

DF->new($file);

for my $page (1 .. $pdf->numPages()) {
my $text = $pdf->getPageText($page);

my @lines = split (/\n/, $text);

foreach my $line (@lines) {
# parse out useful information
}
}

Peter Jamieson · Mar 25, 2011

Eli the Bearded said:
In comp.lang.perl.misc,

Peter Jamieson said:

I am using the following script to parse a collection of
supplied pdf files. Most of the files parsed as expected but
with some the script fell over, no output was produced,
as though the file was invisible. The files are only
alpha-numeric text, no images or graphics.

Click to expand...

Looks like a CAMDF problem.

I modified you script to look like this:

#!/usr/bin/perl -w
use strict;
use warnings;
use CAM:DF;

my $file = shift;
my $pdf = CAM:DF->new($file);

printf "%s: %d pages\n", $file, ($pdf->numPages());

for my $page (1 .. $pdf->numPages()) {
my $text = $pdf->getPageText($page);

my @lines = split (/\n/, $text);

foreach my $line (@lines) {
print $line;
}
}
__END__

And plugged in various PDFs I have lying around. One called "Hektor.pdf"
produces copious non-text output with that script, but pdftotext gives
me lots of clean text output.

$ perl /tmp/campdfprint Hektor.pdf | strings | wc
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
Use of uninitialized value $text in split at /tmp/campdfprint line 14.
1 3 21
$ pdftotext Hektor.pdf ; strings Hektor.txt |wc
161 4023 23930
$ strings Hektor.pdf |head -1
%PDF-1.4
$

Elijah
------
[1] <URL:http://www.hektor.ch/Book/Hektor.pdf/>
Yes, it has a trailing slash in the link.
[2] pdftotext version 3.02
Copyright 1996-2007 Glyph & Cog, LLC

Hi Eli!
Thank you for your comments and assistance.
I had a look at the pdftotext prog but it was
unsuitable as my pdf's had tables and they were
being mangled in the conversion process.
I tried the shareware prog "PDF to Excel" and
it is promising but still needs a bit of cleaning
of the data and the files to be eyeballed, tedious
as I have many files each of many pages.
The pdf files seem to have been created originally
from .xls files so it should be possible to programmitically
reverse them maybe but beyond me.

On an unrelated matter I noticed the links to your
Hektor site: fascinating work! My daughter is a performance
artist and film maker so I will pass your site address to her!
Cheers, Peter

How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Cam::PDF question	4	Jun 30, 2006
Extract images from PDF files	2	Jul 28, 2009
Script for uploading files to a server	2	Oct 23, 2008
Permission Denied error when moving files - Perl	13	Nov 10, 2006
Download files script.	2	Sep 28, 2008
Web Form search and open pdf	7	Dec 8, 2004
Perl/CGI and PDF problem	2	Apr 26, 2006

Parsing some pdf files failed

Peter Jamieson

Peter Jamieson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads