best way to make a few changes in a large data file

C

ccc31807

My big data file looks like this:
1,al
2,becky
3,carl
4,debbie
5,ed
6,frieda
.... for perhaps 200K or 300k lines

My change file looks like this:
5, edward
.... for perhaps ten or twelve lines

My script looks like this (SKIPPING THE DETAILS):
my %big_data_hash;
while (<BIG>) { my ($id, $name) = split; $big_data_hash{$id} = $name; }
while (<CHANGE>) { my ($id, $name) = split; $big_data_hash{$id} = $name; }
foreach my $id (keys %big_data_hash)
{ print OUT qq($id,$big_data_hash{$id}\n); }

This seems wasteful to me, loading several hundred thousand lines of data in memory just to make a few changes. Is there any way to tie the data file to a hash and make the changes directly?

Does anyone have any better ideas?

Thanks, CC.
 
R

Rainer Weikusat

ccc31807 said:
My big data file looks like this:
1,al
2,becky
3,carl
4,debbie
5,ed
6,frieda
... for perhaps 200K or 300k lines

My change file looks like this:
5, edward
... for perhaps ten or twelve lines

My script looks like this (SKIPPING THE DETAILS):
my %big_data_hash;
while (<BIG>) { my ($id, $name) = split; $big_data_hash{$id} = $name; }
while (<CHANGE>) { my ($id, $name) = split; $big_data_hash{$id} = $name; }
foreach my $id (keys %big_data_hash)
{ print OUT qq($id,$big_data_hash{$id}\n); }

This seems wasteful to me, loading several hundred thousand lines of
data in memory just to make a few changes. Is there any way to tie
the data file to a hash and make the changes directly?

For a text file, no, since insertion or removal of characters affects
the relative positions of all characters afte the place where the
change occured. But you algorithm could be improved: Instead of
reading the data file and the changes file into memory completely,
changing the 'data hash' and looping over all keys of that to generate
the modified output, you could read the change file (which is
presumably much smaller) into memory and then process the data file
line by line, applying changes 'on the go' where necessary, ie,
(uncompiled)

my ($id, $name); # don't create new variables for every iteration

while (<CHANGE>) {
($id, $name) = split /,/;
$change_hash{$id} = $name;
}

while (<BIG>) {
($id, $name) = split /,/;
$name = $change_hash{$id} if exists($change_hash{$id});

print OUT qq($id,$name\n);
}
 
R

Rainer Weikusat

bugbear said:
ccc31807 said:
My big data file looks like this:
1,al
2,becky
3,carl
4,debbie
5,ed
6,frieda
... for perhaps 200K or 300k lines

My change file looks like this:
5, edward
... for perhaps ten or twelve lines
[...]

Any improvement would need a change in the file structure - the big win would come from NOT having to modify
a significant number of the disc blocks that represent the file.

This would involve techniques such as indexing, trees of blocks, fixed size padding of the data, or having "pad" areas
to avoid always having to shuffle the data up and down on edits.

I believe Oracle make a piece of software that does

s/makes/bought/. It is called BerkeleyDB. Any of the other freely
available 'hashed database' packages (eg, GDBM) should be usuable as
well.
 
H

Hans Mulder

bugbear said:
ccc31807 said:
My big data file looks like this:
1,al
2,becky
3,carl
4,debbie
5,ed
6,frieda
... for perhaps 200K or 300k lines

My change file looks like this:
5, edward
... for perhaps ten or twelve lines
[...]

Any improvement would need a change in the file structure - the big win would come from NOT having to modify
a significant number of the disc blocks that represent the file.

This would involve techniques such as indexing, trees of blocks, fixed size padding of the data, or having "pad" areas
to avoid always having to shuffle the data up and down on edits.

I believe Oracle make a piece of software that does

s/makes/bought/. It is called BerkeleyDB. Any of the other freely
available 'hashed database' packages (eg, GDBM) should be usuable as
well.

Oracle have also bought a product called MySQL, which may or may not be
what bugbear was thinking of.

-- HansM
 
C

ccc31807

change occured. But you algorithm could be improved: Instead of
reading the data file and the changes file into memory completely,
changing the 'data hash' and looping over all keys of that to generate
the modified output, you could read the change file (which is
presumably much smaller) into memory and then process the data file
line by line, applying changes 'on the go' where necessary, ie,

You would think so, anyway. This was the first thing I tried, and it turns out (on my setup at least) that printing the outfile line by line takes a lot longer than dumping the whole thing into memory then printing the DS once.

I also thought of using the ID as an index to an array and tying the disk file to an array, but to be honest I was just too lazy to try it. The array would be very sparse (several 100k rows out of a potential 10m array, IDs can go as high as 99999999) and it seemed more wasteful than using a hash with only the number of keys that I actually have.

CC.

It's not a big deal, it wouldn't matter if it took 5 seconds to run or 5 minutes to run, as long as it produces the correct results.
 
R

Rainer Weikusat

ccc31807 said:
You would think so, anyway. This was the first thing I tried, and it
turns out (on my setup at least) that printing the outfile line by
line takes a lot longer than dumping the whole thing into memory
then printing the DS once.

You are both reading and printing the output file 'line by line' at
least insofar the (pseudo-)code you posted represented the code you
are actually using accurately. Consequently, your statement above
doesn't make sense, except insofar it communicates that you tried
something which didn't work as you expected to and that you (judgeing
from the text above) don't really have an idea why it didn't do that.

As can be determined by some experiments, constructing a big hash and
doing a few lookups on that is less expensive than constructing a
small hash and doing a lot of lookups on that.

Data file (d0) was created with

perl -e 'for ('a' .. 'z', 'A' .. 'Z') { print ($n++.",$_", "\n"); }'

and concatenating the output of that with itself multiple times (total
of 468 lines), 'changes files' (d1) was

--------
17,X
41,y
22,W
--------

Provided relatively few replacements have to be done 'early on', the
basic idea of using a hash of changes is indeed faster than using a
data hash. Otherwise, things aren't that simple.

----------------
use Benchmark;

open($out, '>', '/dev/null');

timethese(-5,
{ ccc => sub
{
my ($fh, %h, $id, $d);

open($fh, '<', 'd0');
while (<$fh>) {
($id, $d) = split /,/;
$h{$id} = $d;
}
$fh = undef;

open($fh, '<', 'd1');
while (<$fh>) {
($id, $d) = split /,/;
$h{$id} = $d;
}

for (keys(%h)) {
print $out ($_, ',', $h{$_});
}
},

sane => sub
{
my ($fh, %h, $id, $d, $v);

open($fh, '<', 'd1');
while (<$fh>) {
($id, $d) = split /,/;
$h{$id} = $d;
}
$fh = undef;

open($fh, '<', 'd0');
while (<$fh>) {
($id, $d) = split /,/;

$v = $h{$id};
print $out ($id, ',', $d), next unless defined($v);

print $out ($id, ',', $v);
undef($h{$id});
last unless %$h;
}

print $out ($_) while <$fh>;
}});
 
R

Rainer Weikusat

[...]
As can be determined by some experiments, constructing a big hash and
doing a few lookups on that is less expensive than constructing a
small hash and doing a lot of lookups on that.

As a quick addition to that: This is also to simplistic, because the
original code does exactly as many hash lookups except that most of
the are successful. Judgeing from a few more tests, using 'small
integers' as hash keys doesn't seem to be something the perl hashing
algorithm likes very much, eg,

$h{17} = 'X';
$h{22} = 'Y';

will put (for 5.10.1) both data items in the same slot.
 
R

Rainer Weikusat

Ben Morrow said:
If the numbers in the file are consecutive and contiguous, you could use
Tie::File. Otherwise you would be better off using some sort of database
in place of the large file.

That's going to suffer from the same problem as the 'put the changes
into a hash' idea I posted earlier: Searching for a particular key for
a large number of times in a small hash, with most of these searches
being unsuccessful, is going to be slower than building a large hash
and (successfully) searching for a small number of keys in that. And
since there's no way to determine the key of a particular 'big file'
line except by reading this line (which implies reading everyting up to
this line) and parsing it and now way to generate the output stream
except by writing out all 'new' lines in the order they are supposed
to appear, it won't be possible to save any I/O in this way.

There are a number of possibilities here but without knowing more
about the problem, it is not really possible to make sensible
suggestion (Eg, what is supposed to be saves, memory or execution time?
Is it possible to change the process generating the 'big files'? If
not, how often is a file created and how often processed?).
 
C

C.DeRykus

You would think so, anyway. This was the first thing I tried, and it turns out (on my setup at least) that printing the outfile line by line takes alot longer than dumping the whole thing into memory then printing the DS once.



I also thought of using the ID as an index to an array and tying the diskfile to an array, but to be honest I was just too lazy to try it. The array would be very sparse (several 100k rows out of a potential 10m array, IDscan go as high as 99999999) and it seemed more wasteful than using a hash with only the number of keys that I actually have.

...
It's not a big deal, it wouldn't matter if it took 5 seconds to run or 5 minutes to run, as long as it produces the correct results.
...

Since speed isn't critical, the Tie::File suggestion
would simplify the code considerably. Since the whole
file isn't loaded, big files won't be problematic and
any changes to the tied array will update the file
at once. However, id's will sync to actual file line
no's and Tie::File will automatically create empty
lines in the file if the array is sparse.

eg:

use Tie::File;

tie my @array, 'Tie::File', 'd0' or die $!;

open(my $fh, '<', 'd1') or die $!;
while (<$fh>) {
chomp;
my($id, $value) = split /,/;
$array[$id-1] = "$id,$value";
}
 
R

Rainer Weikusat

[...]
Since speed isn't critical, the Tie::File suggestion
would simplify the code considerably.
[...]

use Tie::File;

tie my @array, 'Tie::File', 'd0' or die $!;

open(my $fh, '<', 'd1') or die $!;
while (<$fh>) {
chomp;
my($id, $value) = split /,/;
$array[$id-1] = "$id,$value";
}

Including 1361 lines of code stored in another file does not
'simplify the code'. It makes it a hell lot more complicated. Assuming
that speed doesn't matter, a simple implementation could look like
this

sub small
{
my ($fh, %chgs);

open($fh, '<', 'd1');
%chgs = map { split /,/ } <$fh>;

open($fh, '<', 'd0');
/(.*),(.*)/s, print ($1, ',', $chgs{$1} // $2) while <$fh>;
}

It can be argued that 'using Tie::File', provided the semantics match,
makes the task of the programmer easier but not the code and even this
isn't necessarily true --- Tie::File surely makes it easier to shoot
oneself in the foot here, see 'Defferred Writing' section in the
manpage --- because reading through more han six pages of technical
documentation becomes then also part of the problem. But this shifts
the issue to a different plane: The problem is no longer a technical
one but a personal one -- how can $person with $skills get this done
with as little (intellectual) effort as possible. And while this may
well be a valid concern (eg, if someone has to solve the problem
quickly for his own use) it doesn't translate to a universal
recommendation.
 
T

Ted Zlatanov

c> You would think so, anyway. This was the first thing I tried, and it
c> turns out (on my setup at least) that printing the outfile line by
c> line takes a lot longer than dumping the whole thing into memory then
c> printing the DS once.

I have never experienced this. Could you, for instance, be reopening
the change file repeatedly? Any chance you could post that slow version
of the code?

I would recommend, if you are stuck on the text-based data files, to
use perl -p -e 'BEGIN { # load your change file } ... process ... }'

This doesn't have to be a one-liner, but it's a good way to test quickly
the "slow performance" issue. e.g.

perl -p -e 's/^5,.*/5,edward/' myfile > myrewrite

If that's slow, something's up.

Ted
 
R

Rainer Weikusat

Rainer Weikusat said:
[...]
Since speed isn't critical, the Tie::File suggestion
would simplify the code considerably.
[...]

use Tie::File;

tie my @array, 'Tie::File', 'd0' or die $!;

open(my $fh, '<', 'd1') or die $!;
while (<$fh>) {
chomp;
my($id, $value) = split /,/;
$array[$id-1] = "$id,$value";
}
[...]

Assuming that speed doesn't matter, a simple implementation could
look like this

sub small
{
my ($fh, %chgs);

open($fh, '<', 'd1');
%chgs = map { split /,/ } <$fh>;

open($fh, '<', 'd0');
/(.*),(.*)/s, print ($1, ',', $chgs{$1} // $2) while <$fh>;
}

As an afterthought: Instead of guessing at what's taking the time when
executing the code above, I've instead tested it. The 'small_hash'
implementation below (with data files constructed in the way I
described in an earlier posting) is either faster than big_hash or
runs at comparable speeds (tested with files up to 1004K in size). It
can also process a 251M file which the big_hash one can't do within a
reasonable amount of time because it first causes perl to eat all RAM
available on the system where I tested this and then makes that go into
'heavy thrashing' mode because 'all available RAM' is - by far - not
enough.

----------------
use Benchmark;

open($out, '>', '/dev/null');

timethese(-5,
{
big_hash => sub {
my ($fh, %data, $k, $d);

open($fh, '<', 'd0');
%data = map { split /,/ } <$fh>;

open($fh, '<', 'd1');
while (<$fh>) {
($k, $d) = split /,/;
$data{$k} = $d;
}

print $out ($_, ',', $data{$_}) for keys(%data);
},

small_hash => sub {
my ($fh, %chgs, $k, $d);

open($fh, '<', 'd1');
%chgs = map { split /,/ } <$fh>;

open($fh, '<', 'd0');
while (<$fh>) {
($k, $d) = split /,/;
print $out ($k, ',', $chgs{$k} // $d);
}
}});
 
B

BobMCT

Just a thought, but did you ever consider loading the data into a
temporary indexed database table and 'batch' updating it using the
indexing keys? Then you could dump the table to a flat file when
done. You should be able to use shell commands to dump, run the php
script, then dump the table to a file.

Just my $.02 worth
 
R

Rainer Weikusat

BobMCT said:
Just a thought, but did you ever consider loading the data into a
temporary indexed database table and 'batch' updating it using the
indexing keys?

As I wrote in a reply to an earlier posting: This would be a
perfect job for one of the available 'flat file' database packages,
eg, DB_File. But unless the same 'base data' file is processed more
than once, this means 'read the big file', 'write a big file', 'read
this big file', 'write another big file' and the replacement step
would turn into 'modify the big file'. I doubt that this would be
worth the effort.
 
X

Xho Jingleheimerschmidt

Since speed isn't critical, the Tie::File suggestion would simplify
the code considerably. Since the whole file isn't loaded, big files
won't be problematic

I haven't used it in a while, but if I recall correctly Tie::File stores
the entire table of line-number/byte-offset in RAM, and that can often
be about as large as storing the entire file if the lines are fairly short.

Xho
 
C

C.DeRykus

I haven't used it in a while, but if I recall correctly Tie::File stores

the entire table of line-number/byte-offset in RAM, and that can often

be about as large as storing the entire file if the lines are fairly short.

Actually IIUC, Tie::File is more parsimonious of memory than even DB_File for instance and employs a
"lazy cache" whose size can be user-specified.

See: http://perl.plover.com/TieFile/why-not-DB_File

So, even with overhead of 310 bytes per record, that
would get slow only if the file gets really huge and
least-recently read records start to get tossed.
But the stated aim was accuracy rather than speed.

And, since there's a 10Mb record limit with only 200-300K records, that's unlikely to be show-stopper status. Only a couple of seconds to read a comparably sized file in my simple test.
 
R

Rainer Weikusat

[...]
Tie::File is more parsimonious of memory than even DB_File for instance and employs a
"lazy cache" whose size can be user-specified.

See: http://perl.plover.com/TieFile/why-not-DB_File

So, even with overhead of 310 bytes per record, that
would get slow only if the file gets really huge and
least-recently read records start to get tossed.
But the stated aim was accuracy rather than speed.

Nevertheless, Tie::File not only needs *much* more memory than a
line-by-line processing loop (~5000 bytes vs 138M for a 63M file) but
is also atrociously slow: Replacing 10 randomly selected lines in a
53,248 lines file with a total size of 251K needs (on the system
where I tested this) about 0.02s when reading but about 0.51s when
using Tie::File (and it is probably still completely unsuitable to
solve the original problem to begin with).
 
C

C.DeRykus

[...]


Tie::File is more parsimonious of memory than even DB_File for instanceand employs a
"lazy cache" whose size can be user-specified.

See: http://perl.plover.com/TieFile/why-not-DB_File

So, even with overhead of 310 bytes per record, that
would get slow only if the file gets really huge and
least-recently read records start to get tossed.
But the stated aim was accuracy rather than speed.



Nevertheless, Tie::File not only needs *much* more memory than a

line-by-line processing loop (~5000 bytes vs 138M for a 63M file) but

is also atrociously slow: Replacing 10 randomly selected lines in a

53,248 lines file with a total size of 251K needs (on the system

where I tested this) about 0.02s when reading but about 0.51s when

using Tie::File (and it is probably still completely unsuitable to

solve the original problem to begin with).

In general I'd agree. But there's an upper bound of 10M records. If that scenario changed or some threshold was impacted, you could re-design. But, who cares here if you lose a second of runtime... or memory bumps during thatshort window. The OP said accuracy - not speed - was the objective: "it wouldn't matter if it took 5 seconds to run or 5 minutes to run, as long as it produces the correct results."

The code becomes simpler, more intuitive, timelier. You can quickly move on.... to more pressing/interesting/challenging issues.
 
R

Rainer Weikusat

[...]
In general I'd agree. But there's an upper bound of 10M records. If
that scenario changed or some threshold was impacted, you could
re-design. But, who cares here if you lose a second of runtime... or
memory bumps during that short window. The OP said accuracy - not
speed - was the objective: "it wouldn't matter if it took 5 seconds
to run or 5 minutes to run, as long as it produces the correct
results."

The code becomes simpler, more intuitive, timelier. You can quickly
move on... to more pressing/interesting/challenging issues.

The code does not 'become simpler', it becomes a lot more complicated.
Not even the 'front-end code' which needs to be written specifically for
this is shorter than a sensible (meaning, it performs well)
implementation without Tie::File since it was 8 lines of code in both
cases. A 'performance doesn't matter' implementation can be shorter
than that, as demonstrated. IMO, this is really an example of using a
module because it exists, despite it isn't suitable for solving the
described problem, is a lot more complicated than just using the
facilities already provided by perl and is vastly technically inferior
to these as well.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,701
Latest member
XavierQ83

Latest Threads

Top