PDF: finding a blank image

D

DrLeif

I have about 6000 PDF files which have been produced using a scanner
with more being produced each day. The PDF files contain old paper
records which have been taking up space. The scanner is set to
detect when there is information on the backside of the page (duplex
scan). The problem of course is it's not the always reliable and we
wind up with a number of PDF files containing blank pages.

What I would like to do is have python detect a "blank" pages in a PDF
file and remove it. Any suggestions?


Thanks,
DrL
 
D

David Bolen

DrLeif said:
What I would like to do is have python detect a "blank" pages in a PDF
file and remove it. Any suggestions?

The odds are good that even a blank page is being "rendered" within
the PDF as having some small bits of data due to scanner resolution,
imperfections on the page, etc.. So I suspect you won't be able to just
look for a well-defined pattern in the resulting PDF or anything.

Unless you're using OCR, the odds are good that the scanner is
rendering the PDF as an embedded image. What I'd probably do is
extract the image of the page, and then use image processing on it to
try to identify blank pages. I haven't had the need to do this
myself, and tool availability would depend on platform, but for
example, I'd probably try ImageMagick's convert operation to turn the
PDF into images (like PNGs). I think Gimp can also do a similar
conversion, but you'd probably have to script it yourself.

Once you have an image of a page, you could then use something like
OpenCV to process the page (perhaps a morphology operation to remove
small noise areas, then a threshold or non-zero counter to judge
"blankness"), or probably just something like PIL depending on
complexity of the processing needed.

Once you identify a blank page, removing it could either be with pure
Python (there have been other posts recently about PDF libraries) or
with external tools (such as pdftk under Linux for example).

-- David
 
D

DrLeif

I'd check into ReportLab's commercial product, it may well be easily
capable of that.  If no success, you might contact PJ at Groklaw, she
has dealt with a _lot_ of PDFs (and knows people who deal with PDFs
in bulk).

--Scott David Daniels
(e-mail address removed)
 
D

DrLeif

I'd check into ReportLab's commercial product, it may well be easily
capable of that.  If no success, you might contact PJ at Groklaw, she
has dealt with a _lot_ of PDFs (and knows people who deal with PDFs
in bulk).

--Scott David Daniels
(e-mail address removed)


Thanks everyone for the quick reply.

I had considered using ReportLab however, was uncertain about it's
ability to detect a blank page.

Scott, I'll drop an email to ReportLab and PJ....

Thanks again,
DrLeif
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,161
Messages
2,570,892
Members
47,427
Latest member
HildredDic

Latest Threads

Top