optimize log parsing

  • Thread starter it_says_BALLS_on_your forehead
  • Start date
T

Tassilo v. Parseval

Also sprach (e-mail address removed):
I agree that if you are going to do something incredibly silly, like
parallelizing something that has no legitimate need to be parallelized,
then using ForkManager is probably not the best choice. But then again,
doing silly things is generally not the best choice in the first place,
unless your goal is to be silly (which is a noble goal in itself,
sometimes).

Being silly sometimes helps in making a point. :)
Hardly equivalent. The equivalent thread implementation to the above fork
code would be to spawn one thread per item.

No, it wouldn't. The point is that spawning of a new process per data
item is necessary with Parallel::ForkManager but not with threads since
they can be reused due to their ability to share data. It would be
deliberate stupidity to spawn off a new thread per item instead of
reusing the existing ones.
The equivalent fork implementation to the below threaded code would be
to use forks::shared (which, BTW, is even worse than the ForkManager
way).

That only appears to work for a small set of shared data. With 10.000
items the forks-version merely sits there and doesn't appear to do
anything.
use threads;
use threads::shared;

use constant NUM_THREADS => 30;

my @queue : shared = 1 .. shift;
my @threads;

push @threads, threads->new("run") for 1 .. NUM_THREADS;
$_->join for @threads;

sub run {
while (defined(my $element = shift @queue)) {
print "$element\n";
}
}

On my machine I get: ...
If you increase the number further to, say, 10000, it's already

[processes]
real 0m45.605s
user 0m24.320s
sys 0m21.130s

[threads]
real 0m8.671s
user 0m1.090s
sys 0m7.580s

Parallelization is inherently an optimization step. As such, there is
no general solution and the appropriate way to parallelize is highly
dependent on the details of what is to be parallelized. If I wanted to
parallelize something like your test case, consisting of a large number of
very very fast operations, I would use the unpublished "Parallel_Proc"
module, which divides into chunks up front.

I am sure this is all very nice but also, as you say, unpublished. :)
use Parallel_Proc;

my $pm = Parallel_Proc->new();
my @data = 1 ... shift;

my ($l,$r)=$pm->spawn(30,scalar @data);
foreach(@data[$l..$r]) {
print "$_\n";
}
$pm->harvest(sub{print $_[0]}); #pass childs output out the parent's STDOUT
$pm->Done();

This is already going into the direction of MPI where the global pool of
data is (ideally) once distributed evenly among the processors which
then do their work. After each one has computed its sub-result they are
gathered again.

But this is a different domain altogether. Threads and fork() are useful
for those problems that are not CPU- but network- or event-bound. As
such their job is to make a program more responsive by doing things in
parallel that are slow because a resource is involved that may be shared
without making each request even slower.

I am not even sure that the problem at hand (namely parsing files) is
one of those. If the files are on separate disks it could benefit from
using as many threads/processes as there are disks. More parallel units
are likely to make it slower than a single-threaded application because
of additional overhead incurred by the reader's head jumping around on
the disk.
While this is true, it is not particularly relevant to the poster's
problem. There are cases where threading wins hands down. This original
problem is not one of them.

It was undoubtedly a contrived example I chose. The work per work unit
was so miniscule that the particular overhead of each of the two
solutions became the dominating factor.
I somewhat agree. Parallel::ForkManger was (apparently) designed so that
you can usually take code originally written to be serial, and make it
parallel by simply adding 2 carefully-placed lines (plus 3 house-keeping
lines). The threaded code is pretty much written from the ground up to be
threaded. The threaded code structure tends to be dominated by the
threading, while the ForkManager code tends to be dominated by whatever you
are fundamentally trying to do, which just a few lines making a nod to the
parallelization. This makes it easier to thoughtless add code that breaks
parallelization under ForkManager. So when I substantially refactor code
that uses ForkManager, I simply remove the parallelization, refactor the
code as serial code, then add ForkManager back in at the end.

Parallelizing is a huge intellectual problem for every programmer and
many parallel programs are inherently hard to understand. I haven't yet
found a paradigm that is truely intuitive. The best I've come across so
far is Ada's task-oriented approach but I've seen no other programming
language using this model.

Second best is threads but an already existing serial solution needs to
be rewritten to fit into it.

Then there are processes which are good when no communication between
those is required. Once pieces of data have to be exchanged it gets
ugly. The code is inflated with boring code that keeps the processes
synchronized, reads from pipes, makes the programmer wonder why there is
a deadlock etc.

The fourth approach, explicit message-passing through means like MPI, I
don't really count as this is clearly targeted at scientific computation
and requires spiffy multi-processor mainframes to be benificial.
Oh, one more thing I discovered. Threaded code with a shared queue is
tricky to do if the queue holds references or objects.

If you change the queue to:

my @queue : shared = map {[$_]} 1 .. shift;

Then it dies with "Invalid value for shared scalar". Since the forking
code doesn't use shared values, it doesn't have this particular problem.

You can circumvent this with the rather ugly:

my @queue : shared = map {my $x =[$_]; share $x; $x} 1 .. shift;

With blessed reference, this doesn't work.

Yes, objects don't play nice with threads as of now. The manpage of
threads::shared says this is going to be fixed some day. We'll see.

Tassilo
 
X

xhoster

Tassilo v. Parseval said:
Also sprach (e-mail address removed):

No, it wouldn't. The point is that spawning of a new process per data
item is necessary with Parallel::ForkManager but not with threads since
they can be reused due to their ability to share data.

I would argue that it is only necessary with Parallel::ForkManager if you
can't be bothered to come up with a better way to do it. (Which I ususally
can't be bothered to do, as the overhead is rarely a significant compared
to the overall task). The things I've done generally fall in two classes,
either you can pre-distribute the tasks and use what you refered to as the
MPI-like way, or the tasks are substantial enough that the forking is
irrelevant.
It would be
deliberate stupidity to spawn off a new thread per item instead of
reusing the existing ones.

Deliberately stupid or not, it is certainly much more convenient to
get a return value from "join" rather than to create both a producer
queue for your threads, and then some kind of response queue so the threads
can response without returning, and come up with some mechanism to
correlate the responses in the response queue to the requests in the
request queue.

.....
Second best is threads but an already existing serial solution needs to
be rewritten to fit into it.

I think non-blocking IO is often better than threads for the tasks for
which I think threads are clearly superior than forking. Unfortunately,
non-blocking IO tends to be even more dominating of your code than
roducer/consumer threads are.

Then there are processes which are good when no communication between
those is required. Once pieces of data have to be exchanged it gets
ugly. The code is inflated with boring code that keeps the processes
synchronized, reads from pipes, makes the programmer wonder why there is
a deadlock etc.

Maybe my view is colored by the fact that I've already had to learn to deal
with those things, because threading wasn't a realistic option when I
needed to do a lot of the parallelization I've done. (Ironically, it still
isn't much of an option for me. On the computers I have access to, all the
multi-CPU machines were built with nonthreaded Perl and all the single-CPU
machines have threaded Perl.) Anyway, now we've largely moved from big 32
CPU machines to clusters of many two CPU machines, so neither forking nor
threading are all that useful.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,828
Latest member
LauraCastr

Latest Threads

Top