X
xhoster
thanks, i'll work on improving this to help you help me more
And even better, trimmed down code is good for benchmarking and other
experimentation.
they were initially there because we have an ETL (Extract, Transform,
Load) tool that picks them up, and it was determined that this tool
could optimally use 10 threads to gather these processed records. it
would be best if these 10 records were the same size (or near enough).
i'm working on a 16 CPU machine, so that's why i created 16 forked
processes.
Since your processes have a mixed work load, needing to do both Net::FTP
(probably I/O bound) and per line processing (probably CPU bound), it might
make sense to use more than 16 of them.
....
So I was thinking of
creating a counter, and incrementing it in the loop that spawns the
processes, and mod this counter by 16. the mod'd counter will be a
value that is passed to the betaParse script/function and the parsing
will use this value to choose which filehandle it writes to.
Don't do that. Lets say you start children 0 through 15, writing to
files 0 through 15. Child 8 finishes first. So ForkManager starts child
16, which tries to write to file 16%16 i.e. 0. But of course file 0 is
still being used by child 0. If you wish to avoid doing a flock for every
row, you need to mandate that no two children can be using the same file at
the same time.
I see two good ways to accomplish that. The first is simply to have each
child, as one of the first things it does, loop through a list of
filenames. For each, it opens it and attempts a nonblocking flock. Once it
finds a file it can successfully flock, it keeps that lock for the rest if
its life, and uses that filehandle for output.
The other way is to have ForkManager (in the parent) manage the files that
the children will write to. This has the advantage that, as long as there
is only one parent process running at once, you don't actually need to do
any flocking in the children, as the parent ensures they don't intefere
(but I do the locking anyway if I'm on a system that supports it. Better
safe than sorry):
my $pm=new Parallel::ForkManager(16);
#tokens for the output files.
my @outputID="file01".."file20"; # needs to be >= 16, of course;
#put the token back into the queue once the child is done.
$pm->run_on_finish( sub { push @outputID, $_[2] } ) ;
#...
foreach my $whatever (@whatever) {
#get the next available token for output
my $oid=shift @outputID or die;
$pm->start($oid) and next;
open my $fh, ">>", "/tmp/$oid" or die $!;
flock $fh, LOCK_EX|LOCK_NB or die "Hey, someone is using my file!";
#hold the lock for life
#...
while (<$in>) {
#...
print $fh $stuff_to_print;
};
close $fh or die $!;
$pm->finish();
};
yeah, i never thought it would be easy
Theme song from grad school days: "No one said it would be easy, but no one
said it would be this hard."
Xho