PDF: finding a blank image

DrLeif · Jul 13, 2009

I have about 6000 PDF files which have been produced using a scanner
with more being produced each day. The PDF files contain old paper
records which have been taking up space. The scanner is set to
detect when there is information on the backside of the page (duplex
scan). The problem of course is it's not the always reliable and we
wind up with a number of PDF files containing blank pages.

What I would like to do is have python detect a "blank" pages in a PDF
file and remove it. Any suggestions?

Thanks,
DrL

David Bolen · Jul 14, 2009

DrLeif said:
What I would like to do is have python detect a "blank" pages in a PDF
file and remove it. Any suggestions?

The odds are good that even a blank page is being "rendered" within
the PDF as having some small bits of data due to scanner resolution,
imperfections on the page, etc.. So I suspect you won't be able to just
look for a well-defined pattern in the resulting PDF or anything.

Unless you're using OCR, the odds are good that the scanner is
rendering the PDF as an embedded image. What I'd probably do is
extract the image of the page, and then use image processing on it to
try to identify blank pages. I haven't had the need to do this
myself, and tool availability would depend on platform, but for
example, I'd probably try ImageMagick's convert operation to turn the
PDF into images (like PNGs). I think Gimp can also do a similar
conversion, but you'd probably have to script it yourself.

Once you have an image of a page, you could then use something like
OpenCV to process the page (perhaps a morphology operation to remove
small noise areas, then a threshold or non-zero counter to judge
"blankness"), or probably just something like PIL depending on
complexity of the processing needed.

Once you identify a blank page, removing it could either be with pure
Python (there have been other posts recently about PDF libraries) or
with external tools (such as pdftk under Linux for example).

-- David

DrLeif · Jul 14, 2009

I'd check into ReportLab's commercial product, it may well be easily
capable of that. If no success, you might contact PJ at Groklaw, she
has dealt with a _lot_ of PDFs (and knows people who deal with PDFs
in bulk).

--Scott David Daniels
(e-mail address removed)

DrLeif · Jul 14, 2009

I'd check into ReportLab's commercial product, it may well be easily
capable of that. If no success, you might contact PJ at Groklaw, she
has dealt with a _lot_ of PDFs (and knows people who deal with PDFs
in bulk).

--Scott David Daniels
(e-mail address removed)

Thanks everyone for the quick reply.

I had considered using ReportLab however, was uncertain about it's
ability to detect a blank page.

Scott, I'll drop an email to ReportLab and PJ....

Thanks again,
DrLeif

Help with finding difference between two bodies of text in order	0	Sep 11, 2024
Puzzling PDF	1	Feb 16, 2014
How do I turn my NSF files into a PST file?	4	Dec 30, 2024
FOSS or Freeware, Prefferably Runs on Linux Mint: Search US Goverment Records, Legally to Find Literarary Work	8	Apr 5, 2023
ReportViewer appends a blank page to PDF and Excel	0	Sep 18, 2009
ANN: PollyReports 1.5 -- Band-oriented PDF Report Generator	0	Jul 11, 2012
Suggestions for creating a PDF table	1	Jul 28, 2008
finding a tag in a binary file	5	Feb 23, 2011

PDF: finding a blank image

DrLeif

David Bolen

DrLeif

DrLeif

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads