Cut pages for OCR with RMagick?

Axel Etzold · Sep 29, 2007

Dear all,

I have many scanned pages which I'd like to cut to prepare them
for OCR.
There are two things I'd like to do:

1.) Cut off a header of each page containing the page number,

2.) Find the largest horizontal blanks in a page (which are supposed
to separate chapters) like this:

Chapter1's text Chapter1's text
Chapter1's text Chapter1's text
Chapter1's text Chapter1's text
Chapter1's text Chapter1's text
<---- cut here, at this blank
Chapter2's text Chapter2's text
Chapter2's text Chapter2's text
Chapter2's text Chapter2's text
Chapter2's text Chapter2's text
^
|
--- (Then cut vertically)

I have tried to convert my pages, which are A4 and 600 dpi, to pixel arrays,
but this is quite slow. Is there a better method, ie. using to_blob ?

Thank you very much,

Axel

Ilmari Heikkinen · Sep 29, 2007

Dear all,

I have many scanned pages which I'd like to cut to prepare them
for OCR.
There are two things I'd like to do:

1.) Cut off a header of each page containing the page number,

2.) Find the largest horizontal blanks in a page (which are supposed
to separate chapters) like this:

If you have a binary string of the pixel data in the image
(I guess to_blob gives that), you can do something like this
for cutting the vertical spans of non-white pixels:

scanline_bytes = image_width * bytes_per_pixel
scanlines = pixels.scan(/.{#{scanline_bytes}}/)
chapters = [[]]
scanlines.each{|sl|
if is_white(sl)
chapters << [] unless paragraphs.last.empty?
else
chapters.last << sl
end
}

For finding larger spans of white, keep track of white
scanlines seen previously. #is_white can well be something
that returns true if less than 50 pixels on a scanline are white
or somesuch.

To crop the margins off the chapter scanlines:

left_border = chapter_scanlines.min{|sl| sl =~ /#{non_white_pixel}/ }
left_border -= left_border % bytes_per_pixel
right_border = chapter_scanlines.min{|sl| sl.reverse =~ /#{non_white_pixel}/ }
right_border -= right_border % bytes_per_pixel

chapter_scanlines.map!{|sl| sl[left_border..right_border] }

The middle whitespace can be had by (tune the magic number to signify
enough pixels to not be a character space):

left_border = chapter_scanlines.max{|sl| sl =~ /#{non_white_pixel}{20}/ }

and with reversed scanline for right border.

</imaging regexps for fun and profit>

HTH,

Tim Hunter · Sep 29, 2007

Axel said:
I have tried to convert my pages, which are A4 and 600 dpi, to pixel arrays,
but this is quite slow. Is there a better method, ie. using to_blob ?

to_blob just gives you an in-memory copy of the image file. If the image
is in JPG format, for example, then the blob is an in-memory JPG file.
So, there's no help there.

Ideally you could use some RMagick method or combination of methods to
accomplish your goal. Since the ImageMagick/GraphicsMagick routines are
written in C they'd be much faster. Offhand I can't think of any such
methods, but then I'm not very clever at that sort of thing.

You might try asking the ImageMagick gurus
(http://www.imagemagick.org/discourse-server/) if there's a way to do it
with the command-line utilities. If so, you can usually translate the
commands and options into RMagick methods. See
http://www.simplesystems.org/RMagick/doc/optequiv.html for help with that.

List of effective pages. xsl-fo	0	Jan 23, 2004
finding blocks in black-and-white images (efficiently)	0	Sep 2, 2008
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Creating new pages automatically with PDF::API2	1	May 11, 2008
Archos 70 Android tablet with HTML pages for control and data display	3	Sep 14, 2011
Errata for The C Programming Language, Second Edition, by Brian Kernighanand Dennis Ritchie	4	May 16, 2009
Tags for Custom Server Control, as defined in ToolboxData attribute...	0	Jun 14, 2007
Looking for better Ruby/Tk references...	13	Dec 21, 2005

Cut pages for OCR with RMagick?

Axel Etzold

Ilmari Heikkinen

Tim Hunter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads