Binary file manipulation

M

Monty

I have very large image files that I need to search for consecutive
values of zero. the files are around 800 MB in szie and don't lend
themselves to being loaded into memory for manipulation.

I thought Tie::File might do the trick as it ties an array directly to
a file, but it's written to expect some sort of end-of-line marker,
whereas none of my data has that. Tie::File also won't let me set the
end-of-line marker to empty or null, so I can't use that module.
According to the Tie::File manpage, there doesn't seem to be a way of
connecting an array to a file without these EOL markers, and I didn't
see any options for binary files in the documentation.

Can some one recommend a method for parsing through this much data,
array style, that would let me compare values as though there were
adjacent members of a two-dimensional array?

Thanks
 
A

A. Sinan Unur

I have very large image files that I need to search for consecutive
values of zero. the files are around 800 MB in szie and don't lend
themselves to being loaded into memory for manipulation.

There is no need to load the whole file into memory.

Use sysread to read in chunks, then find consecutive zero bytes.

perldoc -f sysread

If you make an attempt, we will be able to help you better.

Please do read the posting guidelines for this group.

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
X

xhoster

Monty said:
I have very large image files that I need to search for consecutive
values of zero. the files are around 800 MB in szie and don't lend
themselves to being loaded into memory for manipulation.

Why don't they lend themselves to that? Because you don't 800MB of memory
(plus overhead) to spare, or for some other reason?
I thought Tie::File might do the trick as it ties an array directly to
a file, but it's written to expect some sort of end-of-line marker,
whereas none of my data has that.

Yep. That is inherently what Tie::File does. The whole module is centered
around line-oriented, variable line length files.
Tie::File also won't let me set the
end-of-line marker to empty or null, so I can't use that module.
According to the Tie::File manpage, there doesn't seem to be a way of
connecting an array to a file without these EOL markers, and I didn't
see any options for binary files in the documentation.

Tie::File is not the only tying module. I don't know of a tying module,
off the top of my head, that would serve your purposes, but if you look
under the Tie::* hierarchy on CPAN you might get something. But it seems
so easy to implement what you want with seek and read, that I wouldn't
spend much time searching around for a ready-made module.

But really, this problem seems to just be begging for C, rather than
Perl.
Can some one recommend a method for parsing through this much data,
array style, that would let me compare values as though there were
adjacent members of a two-dimensional array?

What does "adjacent" mean to you in a two-dimensional array?

Xho
 
J

jl_post

Monty said:
I have very large image files that I need to search for consecutive
values of zero. the files are around 800 MB in szie and don't lend
themselves to being loaded into memory for manipulation.

If your binary file is just a list of fixed-length records, you can
loop through the records one at a time (provided that you know the
length of the records) by setting the $/ variable, like this:

open(IN, "<file.binary") or die $!;
binmode(IN);
$/ = \1020; # each record is 1020 bytes long
# Loop through the file, 1020 bytes at a time:
while (<IN>)
{
# The binary-record is now in $_
}
close(IN);

If, by chance, you know the pack-string that corresponds to these
records (assuming they are fixed-length records), you can easily view
the data inside the file, like this:

open(IN, "<file.binary") or die $!;
binmode(IN);
# Set the $packString for each record:
my $packString = "i4 d2 Z12 Z256";
# Use the $packString to find the length of each record:
$/ = \(length(pack($packString)));
# Loop through the file, one record at a time:
while (<IN>)
{
# The binary-record is now in $_
my @values = unpack($packString, $_);
print "*** Record found:\n @values\n";
}
close(IN);

A few notes to keep in mind:

1. Since you are dealing with binary data, you really want to call
binmode() on your filehandle. Not doing so prevents your code from
being portable and may create some hard-to-find bugs.

2. Since this method uses the pack() and unpack() functions, you
really want to turn on warnings and strictures (with "use warnings;"
and "use strict;" near the top of your file). Not doing so will make
simple bugs extremely difficult to find. If you use them, they will
often point out the exact line number where an error occurs right away
(eliminating the need for you to hunt down the error's exact spot).

3. If you are not familiar with pack(), unpack(), and how to compose
pack strings, I encourage you to read the perldocs by typing "perldoc
pack" at any prompt.

4. If you are not familiar with the $/ variable, look it up in
"perldoc perlvar".

Can some one recommend a method for parsing through this much data,
array style, that would let me compare values as though there were
adjacent members of a two-dimensional array?

I'm not quite sure what you mean here, but if your files contain
fixed length records, you would probably benefit by using the $/
variable. And if you know the data-types of the fixed-length records,
the pack() and unpack() functions are extremely useful.

I hope this helps, Monty.

-- Jean-Luc Romano
 
M

Monty

Frist off, thanks to the respondents. Secondly, I may have posted
erroneously here or some how created a breech of etiquette for which I
apologize. I've read what I could find on guidelines for posting and
thought I was within parameters.

I left out a few things maybe I should have mentioned: the file has no
individual records--it's one, huge, 800 MB file of image data that gets
displayed in a rectangular format. As bytes get read in they're used
to populate an image of predetermined size in both width and height.

to A. Sinan Anur: I contemplated using sysread, but am not experienced
with programming enough that I think I may miss adjacent bytes of data
across different chunks of the file. For instance, if I find a data
hole (value 0) in a byte at the end of a chunk, I would need to somehow
remember that so that in the ensuing chunk I can look for another hole
that would coincide with being just 'below' the value I found in the
previous chunk. While this isn't completely out of the question, I
though there might be another way. Also, I'll check the guidelines you
listed.

To Xho: Our system has 32 GB RAM, most of which often goes unused. I
haven't quite figured out how to up the inidividual user memory
allocation limit, but it still seems like there should be a better way,
one that's not dependent on having a slew of memory to toss around.
Also, 'adjacent' in a two dimensional array would be those bytes that
are next to each other in the same row, or directly above or below each
other in the same column.

To: Jean-Luc Romano: thanks, but there are no individual records in
this file.

To John: I'll check those links out. They sound promising.

Thanks again all!
 
D

Dr.Ruud

Monty schreef:
one, huge, 800 MB file of image data that
gets displayed in a rectangular format. As bytes get read in they're
used to populate an image of predetermined size in both width and
height. [...]
'adjacent' in a two dimensional array would be those
bytes that are next to each other in the same row, or directly above
or below each other in the same column.

You are (almost) implying there that there is 1 byte per pixel (or per
pixel.color or pixel.channel or pixel.layer, etc.).

Have you checked GD yet?
http://search.cpan.org/~lds/GD/
 
X

xhoster

Monty said:
Frist off, thanks to the respondents. Secondly, I may have posted
erroneously here or some how created a breech of etiquette for which I
apologize. I've read what I could find on guidelines for posting and
thought I was within parameters.

I left out a few things maybe I should have mentioned: the file has no
individual records--it's one, huge, 800 MB file of image data that gets
displayed in a rectangular format. As bytes get read in they're used
to populate an image of predetermined size in both width and height.

There is actually a hierarchy of records. You have one record type,
containing some fixed number of bytes, which represents a row (or is it a
column) in the image. Within that, you have another record type,
presumably one byte (or is it one bit? Or something else?) representing a
pixel within that row in the image.

to A. Sinan Anur: I contemplated using sysread, but am not experienced
with programming enough that I think I may miss adjacent bytes of data
across different chunks of the file.

I would use read rather than sysread and make each chunk exactly equal
to one row of the image (or just set $/ to a reference to the
number of bytes in a row said:
For instance, if I find a data
hole (value 0) in a byte at the end of a chunk, I would need to somehow
remember that so that in the ensuing chunk I can look for another hole
that would coincide with being just 'below' the value I found in the
previous chunk.

If the next chunk was the start of the next image-row, then you wouldn't
need to worry about it. (assuming your space is flat, like a chessboard.
If it is really a torus represented as a flat image, like a pac-man screen,
then that is different.)

While this isn't completely out of the question, I
though there might be another way. Also, I'll check the guidelines you
listed.

To Xho: Our system has 32 GB RAM, most of which often goes unused. I
haven't quite figured out how to up the inidividual user memory
allocation limit, but it still seems like there should be a better way,
one that's not dependent on having a slew of memory to toss around.
Also, 'adjacent' in a two dimensional array would be those bytes that
are next to each other in the same row, or directly above or below each
other in the same column.

So at any one time you only need two rows worth of bytes in memory.

binmode $fh;
$/=\$how_ever_many_bytes_in_a_row;

my $old_row;
while (<$fh>) {
find_adjacent_in_row($_);
find_adjacent_between_rows($old_row,$_) if defined $old_row;
$old_row=$_;
};

Xho
 
M

Monty

I see what you're saying, but the number of adjacent rows containing a
zero value in a particular column could be very large (minimum of 10
vertically or horizontally adjacent bytes is considered a hole, and
often becomes 100 or more).

Secondly, are you saying (with $/=\$how_ever_many_bytes_in_a_row) that
the end-of-line delimiter can be set to a paticular number of bytes
instead of an actual end-of-line value?
 
G

Guest

: I see what you're saying, but the number of adjacent rows containing a
: zero value in a particular column could be very large (minimum of 10
: vertically or horizontally adjacent bytes is considered a hole, and
: often becomes 100 or more).

100 lines is still easy to handle, you can push them into a "gliding stack"
(FIFO, first in, first out) realized via an array. The array combines well
with the map function (perldoc -f map), you can perform checks of virtually
arbitrary complexity on any desired number of rows (simply by indicating
the size of the list passed to map).

: Secondly, are you saying (with $/=\$how_ever_many_bytes_in_a_row) that
: the end-of-line delimiter can be set to a paticular number of bytes
: instead of an actual end-of-line value?

Yes, Perl allows for reading fixed-length "records" without any visible
eol character. Check read and sysread. If you have any knowledge of your
data _before_ you run your program, you can hard-code the record length
into your program, but you can also set the record length dynamically,
e.g. by reading specific bytes from your file.

Oliver.
 
A

Anno Siegel


[finding adjacent zeros in a matrix of bytes]
So at any one time you only need two rows worth of bytes in memory.

binmode $fh;
$/=\$how_ever_many_bytes_in_a_row;

my $old_row;
while (<$fh>) {
find_adjacent_in_row($_);
find_adjacent_between_rows($old_row,$_) if defined $old_row;
$old_row=$_;
};

If the majority of bytes are non-zero (as the term "hole" might suggest)
one could simply record their positions.

my %by_lines;
while ( <$fh> ) {
next unless /\0/; # don't record lines without zeros
push @{ $by_lines{ $.}, $-[ 0] while /\0/g;
}

Finding chains of adjacent zeros in a line means searching a (sorted)
list of integers for runs of consecutive integers. That's not hard
to do.

To do the same thing for columns, "invert" the %by_lines hash

my %by_columns;
for my $li ( sort { $a <=> $b } keys %by_lines ) {
push @{ $by_columns{ $_} }, $li for @{ $by_columns{ $li} };
}

Then apply the same procedure to find adjacent zeros in each column.

Other patterns of zeros could be detected, but may involve using both
tables at once.

Anno
 
A

Anno Siegel


[finding adjacent zeros in a matrix of bytes]
So at any one time you only need two rows worth of bytes in memory.

binmode $fh;
$/=\$how_ever_many_bytes_in_a_row;

my $old_row;
while (<$fh>) {
find_adjacent_in_row($_);
find_adjacent_between_rows($old_row,$_) if defined $old_row;
$old_row=$_;
};

If the majority of bytes are non-zero (as the term "hole" might suggest)
one could simply record their positions.

my %by_lines;
while ( <$fh> ) {
next unless /\0/; # don't record lines without zeros
push @{ $by_lines{ $.}, $-[ 0] while /\0/g;
}

Finding chains of adjacent zeros in a line means searching a (sorted)
list of integers for runs of consecutive integers. That's not hard
to do.

To do the same thing for columns, "invert" the %by_lines hash

my %by_columns;
for my $li ( sort { $a <=> $b } keys %by_lines ) {
push @{ $by_columns{ $_} }, $li for @{ $by_columns{ $li} };
}

Then apply the same procedure to find adjacent zeros in each column.

Other patterns of zeros could be detected, but may involve using both
tables at once. (Code untested)

Anno
 
X

xhoster

Monty said:
I see what you're saying,

But I don't :)

Please quote some of the text you are replying to. That way I can more
easily see which part of what I said you are responding to.
but the number of adjacent rows containing a
zero value in a particular column could be very large (minimum of 10
vertically or horizontally adjacent bytes is considered a hole, and
often becomes 100 or more).

Ah, this is different. I thought you meant a pair of adjacent things,
not a whole run of them. What if it is a 9 by 9 square of zero values? no
direction is a minimum of 10, yet overall there are 81 missing pixels.

What other wrinkles are there that you haven't described yet? Once you
find these runs or streaks (or blotches or squares or whatever), what are
you going to do with them?
Secondly, are you saying (with $/=\$how_ever_many_bytes_in_a_row) that
the end-of-line delimiter can be set to a paticular number of bytes
instead of an actual end-of-line value?

Yes

Xho
 
M

Monty

I see what you're saying,
But I don't :)

I meant in general, I got the drift of your advice.
I thought you meant a pair of adjacent things,
not a whole run of them.

It starts with adjacent members :)
What if it is a 9 by 9 square of zero values?

That may come later. For now, to find a horizontal or vertical run of
zeroes will be a good start, and we've yet to decide what to do with
them.
 
M

Monty

To all:

Please disregard my previous post. I'm still learning how these
newsgroups and their protocols work.

Many thanks for the good advice. Let's end this thread before I spend
more time maintaining it instead of implementing these suggestions.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
474,181
Messages
2,570,969
Members
47,536
Latest member
VeldaYoung

Latest Threads

Top