Cut pages for OCR with RMagick?

A

Axel Etzold

Dear all,

I have many scanned pages which I'd like to cut to prepare them
for OCR.
There are two things I'd like to do:

1.) Cut off a header of each page containing the page number,

2.) Find the largest horizontal blanks in a page (which are supposed
to separate chapters) like this:

Chapter1's text Chapter1's text
Chapter1's text Chapter1's text
Chapter1's text Chapter1's text
Chapter1's text Chapter1's text
<---- cut here, at this blank
Chapter2's text Chapter2's text
Chapter2's text Chapter2's text
Chapter2's text Chapter2's text
Chapter2's text Chapter2's text
^
|
--- (Then cut vertically)

I have tried to convert my pages, which are A4 and 600 dpi, to pixel arrays,
but this is quite slow. Is there a better method, ie. using to_blob ?

Thank you very much,

Axel
 
I

Ilmari Heikkinen

Dear all,

I have many scanned pages which I'd like to cut to prepare them
for OCR.
There are two things I'd like to do:

1.) Cut off a header of each page containing the page number,

2.) Find the largest horizontal blanks in a page (which are supposed
to separate chapters) like this:

If you have a binary string of the pixel data in the image
(I guess to_blob gives that), you can do something like this
for cutting the vertical spans of non-white pixels:

scanline_bytes = image_width * bytes_per_pixel
scanlines = pixels.scan(/.{#{scanline_bytes}}/)
chapters = [[]]
scanlines.each{|sl|
if is_white(sl)
chapters << [] unless paragraphs.last.empty?
else
chapters.last << sl
end
}

For finding larger spans of white, keep track of white
scanlines seen previously. #is_white can well be something
that returns true if less than 50 pixels on a scanline are white
or somesuch.

To crop the margins off the chapter scanlines:

left_border = chapter_scanlines.min{|sl| sl =~ /#{non_white_pixel}/ }
left_border -= left_border % bytes_per_pixel
right_border = chapter_scanlines.min{|sl| sl.reverse =~ /#{non_white_pixel}/ }
right_border -= right_border % bytes_per_pixel

chapter_scanlines.map!{|sl| sl[left_border..right_border] }

The middle whitespace can be had by (tune the magic number to signify
enough pixels to not be a character space):

left_border = chapter_scanlines.max{|sl| sl =~ /#{non_white_pixel}{20}/ }

and with reversed scanline for right border.

</imaging regexps for fun and profit>


HTH,
 
T

Tim Hunter

Axel said:
I have tried to convert my pages, which are A4 and 600 dpi, to pixel arrays,
but this is quite slow. Is there a better method, ie. using to_blob ?

to_blob just gives you an in-memory copy of the image file. If the image
is in JPG format, for example, then the blob is an in-memory JPG file.
So, there's no help there.

Ideally you could use some RMagick method or combination of methods to
accomplish your goal. Since the ImageMagick/GraphicsMagick routines are
written in C they'd be much faster. Offhand I can't think of any such
methods, but then I'm not very clever at that sort of thing.

You might try asking the ImageMagick gurus
(http://www.imagemagick.org/discourse-server/) if there's a way to do it
with the command-line utilities. If so, you can usually translate the
commands and options into RMagick methods. See
http://www.simplesystems.org/RMagick/doc/optequiv.html for help with that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,821
Latest member
AleidaSchi

Latest Threads

Top